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Foreword 


The Japan Science and Technology Agency (JST) is an independent public body 
of the Ministry of Education, Culture, Sports, Science and Technology (MEXT). JST 
plays a key role in implementing science and technology policies formulated in line 
with the nation’s Science and Technology Basic Plan. The Basic Research Programs 
at JST focus on fundamental research areas that help developing technological 
breakthroughs, which in turn lead to the advance of S&T and creation of new 
industries. The programs also encourage researches that trigger, through innovations, 
reformation of social and economic structures. Core Research for Evolutionary 
Science and Technology (CREST) program is one of the Basic Research Programs at 
JST. With an aim to promote and encourage the development of breakthrough tech- 
nologies that contribute to the attainment of the country’s strategic objectives, JST 
provides a variety of research funding programs for promising research projects. 
CREST is one of JST’s major undertakings for stimulating achievement in funda- 
mental science fields. In addition, returning the fruits of such research to society 
through innovations is another important responsibility of JST. 

“Advanced Core Technologies for Big Data Integration” study area will aim for 
the creation, advancement, and systematization of next-generation core technology 
solving of essential issues common among a number of data domains, and inte- 
grated analysis of big data in a variety of fields. Specific development targets 
include technology for stable operation of large-scale data management systems 
that compress, transfer, and store big data, technology for efficiently retrieving truly 
necessary knowledge by means of search, comparison, and visualization across 
diverse information, and the mathematical methods and algorithms enabling such 
services. In pursuing these studies, with a view to overall system design up to the 
creation of value for society from big data, the creation, advancement, and sys- 
tematization of next-generation common core technology highly acceptable to the 
public will be undertaken, through active efforts at fusion with fields outside of 
information and communication technology. There are total 11 projects. Especially, 
“The Security Infrastructure Technology for Integrated Utilization of Big Data,” by 
Atsuko Miyaji (Research director), focuses on secure well-balanced utilization of 
big data. Many existing security researches focus on technologies of “fast encrypted 
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calculation” since they focus on statistical computation such as sum and average. 
However, the big data are varied, and thus, there are many usages. It cannot be said 
that use only for statistical data such as sum and average is enough. It would not be 
limited to statistical data in the case of medical image data, picture data, etc. What 
should be the security infrastructure for the utilization of such a wide variety of big 
data? In addition, extremely secure technologies often may give any benefit to 
neither the data owner nor the data user. Her project builds a technology to realize 
balanced security and utilization of big data from the viewpoint of three organi- 
zations of the data owner, analyst, and user. Their technology can be combined with 
fast encrypted calculation, which is a typical target of existing cryptographic 
researches. We really hope that their concept of security infrastructure technology 
for the utilization of big data would open up the world of big data utilization in 
various fields such as the medical and living safety field. 


January 2020 Prof. Masaru Kitsuregawa 
The University of Tokyo 
Tokyo, Japan 


Preface 


A project of “The Security Infrastructure Technology for Integrated Utilization of 
Big Data” started in October 2014. Our team consists of four groups: security 
primitive group under the guidance of Atsuko Miyaji at Osaka university, security 
management primitive group under Kiyomoto at KDDI Laboratory, the living safety 
field under Kitamura at AIST and Nishida at Tokyo Institute of Technology, and the 
medical field under Tanaka at the National Cancer Center and Yamamoto at MEDIS. 
Concretely, both Kiyomoto and Miyaji have investigated the security infrastructure 
necessary for the utilization of big data. Based on this security infrastructure, 
Kitamura and Nishida made testbed systems in the living safety field; Tanaka and 
Yamamoto made testbed systems in the medical field. All studies combined aim to 
ensure the good working of the security infrastructure in the real world. Furthermore, 
after both Kitamura and Nishida will integrate the necessary big data excluding 
privacy information using our security infrastructure, they will analyze why serious 
injuries occur at elementary schools. In contrast, both Tanaka and Yamamoto have 
made an open medical network using our security infrastructure, which enables 
patients to check the usage of their medical records distributed in different hospitals. 

One of the features of our project is that it builds security infrastructure for big 
data utilization based not on security researchers but on issues from the living safety 
and medical fields that actually use big data. In other words, it is an important 
feature that the required specifications do not deviate from actual problems. In 
addition, we report the results of actual research in both fields using the security 
infrastructure constructed according to their requirements. Thus, the analysis has 
been performed on only the available and acceptable data from the point of view of 
privacy policy until our security infrastructure was realized. Furthermore, the 
evaluation or analysis of security primitives is often based on dummy data. 
However, our security primitives have been evaluated by researchers who actually 
use big data. Furthermore, we clarify how to introduce such security solutions into 
living safety and medical fields. We also provide guidance on how to use the 
security infrastructure. We hope that this book will be used by companies, schools, 
and public organizations that are considering using big data. 
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Chapter 1 A) 
Introduction E 


Atsuko Miyaji, Shinsaku Kiyomoto, Katsuya Tanaka, Yoshifumi Nishida, 
and Koji Kitamura 


1.1 Purpose of Miyaji-CREST 


Recently, big data analysis results are expected to be used in various situations such 
as medical or industrial fields for new medicine or product development. For this 
reason, it is important to establish a secure infrastructure of the collection, analysis, 
and use of big data. We need to consider mainly three entities for the infrastructure: 
data owner, analysis institutions, and users. This research pays attention to a balance 
between privacy and utilization and also realizes appropriate reduction and feedback 
of the data analysis results to the data owners. 

To build a secure big data infrastructure that connects data owners, analysis 
institutions, and user institutions in a circle of trust, we construct security tech- 
nologies necessary for big data utilization. Our main security technologies are obliv- 
ious RAM (ORAM), private set intersection (PSI), privacy-preserving classification, 
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Fig. 1.1 Overview of security infrastructure from data collections to utilization 


privacy configuration support, privacy risk assessment, and traceability. Furthermore, 
we consider the robustness against various attacks such as cyber attacks and post- 
quantum security. 

We construct a safe and privacy-preserving big data distribution platform that 
realizes the collection, analysis, utilization, and return of owners of big data in a 
secure and fair manner. 

In addition, we demonstrate our secure big data infrastructure in a medical and 
living safety field. Figure 1.1 shows an overview of our research. 


1.2 Roles of Each Group 


1.2.1 Security Core Group 


We constructed security primitives in the following fields with the aim of realizing an 
infrastructure for big data utilization that conducts collection, analysis, and utiliza- 
tion of big data securely: 1. Analysis of security basis: Any security primitive which 
is used for an infrastructure of big data utilization is based on cryptology algorithms. 
That is, a security primitive becomes compromised if the underground cryptology 
algorithm is attacked. Therefore, security analysis on cryptographic primitives is 
important. In this research, we focus on elliptic curve cryptosystems, which achieve 
a compact public key cryptosystem, and learning with error (LWE)-based cryp- 
tosystems, which are types of post-quantum cryptosystems. 2. Privacy-preserving 
data integration among databases distributed in different organizations: This primi- 
tive integrates the same data among databases kept in different organizations while 
keeping any different data in an organization secret to other organizations. 3. A 
privacy-preserving classification: This primitive executes a procedure for the server’s 
classification rule to the client’s input database and outputs only a result to the client 
while keeping client’s input database secret to the server and server’s classification 
rule to the client. 


1 Introduction 3 


1.2.2 Security Management Group 


Our group focuses on research on data anonymization techniques. First, we analyze 
the existing anonymization techniques and adversary models for the techniques and 
clarify our research motivation. Then, we propose our adversary model applicable to 
several anonymization methods and propose a novel privacy risk analysis method. 
An implementation of our data anonymization tool based on the risk analysis method 
is introduced in the chapter. 


1.2.3 Living Safety Testbed Group 


The Living Safety Group deals with developing new technologies for injury pre- 
vention in daily environments such as school safety and home safety based on the 
security platform developed by the Security Core Group and the Security Manage- 
ment Group. This group has devoted itself to not only developing technology for 
handing the big data related to injury but also empowering practitioners through 
social implementation utilizing the developed technologies in cooperation with mul- 
tiple stakeholders. 


1.2.4 Health Testbed Group 


Health Testbed Group is focused on implementing a secure clinical data collection 
and analysis infrastructure for clinical research using the cloud by applying the 
security primitives developed by the Security Core Group and Security Management 
Group. This group is working on standardization of data storage, cross-institutional 
collection, and analysis for electronic medical record data, management mechanism 
of patient consent information, and traceability for secondary use of medical data, 
for the development of our health testbed. 
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Chapter 2 A) 
Cryptography Core Technology crest 


Chen-Mou Cheng, Kenta Kodera, Atsuko Miyaji, and Shinya Okumura 


Abstract In this chapter, we describe the analysis of security basis. One is the 
analysis of elliptic curve discrete logarithm problem (ECDLP). ECDLP is one of the 
public-key cryptosystems that can achieve a short key size but it is not a post-quantum 
cryptosystem. Another is analysis to learning with error (LWE), which is a post- 
quantum cryptosystem and has the functionality of homomorphic encryption. These 
two security bases have important roles in each protocol described in Sect. 2.2.4.2 


2.1 Analysis on ECDLP 


2.1.1 Introduction 


In recent years, elliptic curve cryptography is gaining momentum in deployment 
because it can achieve the same level of security as RSA using much shorter keys 
and ciphertexts. The security of elliptic curve cryptography is closely related to the 
computational complexity of the elliptic curve discrete logarithm problem (ECDLP). 
Let p be a prime number and E, a nonsingular elliptic curve over F p», which is a finite 
field of p” elements. That is, E is a plane algebraic curve defined by the equation 
y?>=x3+ax+bfora,be F such that A = —16(4a? + 27b?) Æ 0. Along with 
a point O at infinity, the set of rational points E (F p) forms an abelian group with O 
as the identity. Given P € E(F,) and Q in the subgroup generated by P, ECDLP 
is the problem of finding an integer œ such that Q = aP. 
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Today, the best practical attacks against ECDLP are exponential-time, generic 
discrete logarithm algorithms such as Pollard’s rho method [34]. However, recently, a 
line of research has been dedicated to the index calculus for ECDLP which was started 
by Semaev, Gaudry, and Diem [25, 30, 35]. Under certain heuristic assumptions, such 
algorithms could lead to subexponential attacks to ECDLP in some cases [27, 31, 
33]. The interested reader is referred to a survey paper by Galbraith and Gaudry for 
a more comprehensive and in-depth account of the recent development of ECDLP 
algorithms along various directions [28]. 

In this section, we investigate the computational complexity of ECDLP for ellip- 
tic curves in various forms—including Hessian [36], Montgomery [32], (twisted) 
Edwards [23, 24], and Weierstrass, using index calculus. Recently, elliptic curves of 
various forms such as Curve25519 [22] have been drawing considerable attention 
in deployment partly because some of them allow fast implementation and secu- 
rity against timing-based side-channel attacks. Furthermore, we can construct these 
curves not only over prime fields (such as the field of 27°> — 19 elements as used in 
Curve25519) but also over extension fields. In this section, we will focus on curves 
over optimal extension fields (OEFs) [21]. An OEF is an extension field from a prime 
field F, with p close to 28, 2'6 232, 2% etc. Such primes fit nicely into the processor 
words of 8-, 16-, 32-, or 64-bit microprocessors and hence are particularly suitable 
for software implementation, allowing efficient utilization of fast integer arithmetic 
on modern microprocessors [21]. As we will see, our experimental results show 
considerably significant differences in the computational complexity of ECDLP for 
elliptic curves in various forms over OEFs. 


2.1.2 Previous Works 


2.1.2.1 Index Calculus for ECDLP 


Let E be an elliptic curve defined over a finite field F ,.. For cryptographic applica- 
tions, we are mostly interested in a prime-order subgroup generated by a rational point 
P € E(F >»). Here, we first give a high-level overview of a typical index-calculus 
algorithm for finding an integer œ such that Q = «œP for Q € (P). 


1. Determine a factor base F C E(F pr). 
2. Collect a set R of relations by decomposing random points a; P + b; Q into a sum 
of points from F, i.e., 


R= aiP +biQ =Y Pij: Pj EF 
j 


3. When |R| ~ |F|, eliminate the right-hand side using linear algebra to obtain an 
equation of the form aP + bQ = O and a = —a/b mod ord P. 
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The last step of linear algebra is relatively well studied in the literature, so we will 
focus on the subproblem in the second step, namely, the point decomposition problem 
(PDP) on an elliptic curve in the rest of this section. 


Definition 2.1 (Point Decomposition Problem of mth Order) Given a rational point 
R € E(F p) on an elliptic curve E and a factor base F C E(F p), find, if they exist, 
P,,..., Pm € F such that 

R= Pi +---+ Pn. 


2.1.2.2 Semaev’s Summation Polynomials 


We can solve PDP by considering when the sum of a set of points becomes zero on 
an elliptic curve. It is straightforward that if two points sum to zero on an elliptic 
curve E : y? = x°? + ax +b in Weierstrass form, then their x-coordinates must be 
equal. Let us now consider the simplest yet nontrivial case where three points on E 
sum to zero. Let 


za | Coye y2 3, y3) € Fón : Œi yi) € E@ p), i = 1, 2, 3; 
(x1, 91) + G2, y) + æy) = J 


Clearly, Z is in the variety of the ideal J C Fy»[X1, Yı, X2, Y2, X3, Y3] generated by 


Y? — (X? +aX; + b), i =1,2,3; 
(X; — X1)(Yo — Yi) — (X2 — X1)(%3 — Y) | 


Now let J = I AN Fp[X1, X2, X3]. Using MAGMA’s El iminationIdeal func- 
tion, we find that J is actually a principal ideal generated by the polynomial 
(X2 — X3)(X1 — X3)(X1 — X2) fs, where 


fs =X? XZ — 2X? X2 X3 + X X? — 2X1 X? X; — 2X, X2X2 — 2aX 1X, — 2a Xı X3 
— 4bX, + XŻX} — 2a XX3 — 4bX — 4bX; + a°. 


Clearly, the linear factors of this generator correspond to the degenerated case where 
two or more points are the same or of opposite signs, and f3 is the 3rd summation 
polynomial, that is, the summation polynomial for three distinct points summing to 
zero. 

Starting from the 3rd summation polynomial, we can recursively construct the 
subsequent summation polynomials fm for m > 3 by taking resultants. As a result, 
the degree of each variable in fn is 2”-?, which grows exponentially as m. This is 
the observation Semaev made in his seminal work [35]. In short, his proposal is to 
consider factor bases of the following form: 


F= [ay EEF p):xEeEVC Fp}, 
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where V is a subset of F',». Then, we solve PDP of mth order by solving the corre- 
sponding (m + 1)th summation polynomial fin+)(X1,..., Xm, xX) = 0, where x is 
the x-coordinate of the point to be decomposed. 

Note that this factor base is naturally invariant under point negation. That is, 
P; € F implies —P; € F. In this case, we have about |F |/2 (trivial) relations P; + 
(—P;) = O for free, so we only need to find the other |F |/2 nontrivial relations. In 
general, we will only discuss factor bases that are invariant under point negation, so 
by abuse of language, both F and F modulo point negation may be referred to as a 
factor base in the rest of this section. 


2.1.2.3 Weil Restriction 


Restricting the x-coordinates of the points in a factor base to a subset of Fp» is 
important from the viewpoint of polynomial system solving. Take f3 as an example. 
When decomposing a random point aP + bQ, we first substitute its x-coordinate 
into say X3, projecting the ideal onto F,»[X,, X2]. The dimension of the variety of 
this ideal is nonzero. Therefore, we would like to pose some restrictions on X; and 
X2 to reduce the dimensions to zero so that the solving time can be more manageable. 

When looking for solutions to a polynomial f = )°a;X! € F,»[X] in Fp, 
we can view F,.[X] as a commutative affine algebra A = F p [X]/ (xP — xX) 
Fy [X1,..., Xnl/(X? — X1,..., Xf — Xn). This can be done by identifying the 


indeterminate X as X101 + --- + X,0,, where (6), ..., On) is a basis for F p» over F. 
Hence, f can be identified as a polynomial f101 +---+ f,6,, where fi,..., fn € 
A =F,[X1,..., Xnl/(X? — X1,..., XF — Xn), by appropriately sending each 
coefficient a; € Fyn to a6 dare al”, for a, mes a,” e F,. Therefore, an 


equation f = 0 over F, will give rise to a system of equations fi =--- = f, =0 
over F,,. This technique is known as the Weil restriction and is used in the Gaudry— 
Diem attack, where the factor base is chosen to consist of points whose x-coordinates 
lie in a subspace V of Fp» over F, [25, 30]. 


2.1.2.4 Exploiting Symmetry 


Naturally, the symmetric group Sm acts on a point decomposition P; +---+ Pm 
because elliptic curve groups are abelian. As noted by Gaudry in his seminal 
work [30], we can therefore rewrite the variables x,,..., Xm € Fp» by elemen- 
tary symmetric polynomials e1, ..., €m, where e1 = }` xi, e2 = igi XiXj, €3 = 
2: Li.igk, jk ¥iXjXk etc. Such rewriting can reduce the degree of summation poly- 
nomials and significantly speed up point decomposition [27, 31]. 

We might be able to exploit additional symmetry brought by actions of other 
groups, e.g., when the factor base is invariant under addition of small torsion points. 
For example, consider a decomposition of a point R under the action of addition of 
a 2-torsion point Tz: 
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n—1 
R= Pi +- + Pa = (Pi +uiTo) ++ + (Pat + un—1 Ta) + Ç + (5 n) n) 
i=l 
Clearly, this holds for any u1, .. ., Un—1 € {0, 1}, so a decomposition can give rise to 
2”—1 — 1 other decompositions. Similar to rewriting using the elementary symmetric 
polynomials for the action of Sm, we can also take advantage of this additional 
symmetry by appropriately rewriting [26]. 

Naturally, such speedup is curve-specific. Furthermore, even if the factor base is 
invariant under additional group actions, we may or may not be able to exploit such 
symmetry to speed up the point decomposition depending on whether the action is 
“easy to handle in the polynomial system solving process” [26]. 


2.1.2.5 PDP on (Twisted) Edwards Curves 


Faugère, Gaudry, Hout, and Renault studied PDP on twisted Edwards, twisted Jacobi 
intersections, and Weierstrass curves [26]. For the sake of completeness, we include 
some of their results here. An Edwards curve over Fp» for p Æ 2 is defined by the 
equation x? + y? = 1 + dx? y? for certain d € F p [24]. A twisted Edwards curve 
t Ea a over F p» for p Æ 2 is defined by the equation ax? + y? = 1 + dx?y? for certain 
a, d € F p [23]. A twisted Edwards curve is a quadratic twist of an Edwards curve by 
dao = 1/(a — d). For P = (x, y) € tEa a, —P = (—x, y). Furthermore, the addition 
and doubling formulae for (x3, y3) = (x1, y1) + (x2, y2) are given as follows: 


ee X12 + yix2 
a rE, 
~  L+dx\x2y1y2 
Wh , ’ : 
en (x1, y1) Æ (X2, y2) " Yi y2 — 4X1X2 
z5 


~ 1- dxix2y1y2" 


P 2x1yı 
3= 7? 
1+dx?y? 
When (x1, y1) = %2, y2) : 2 2 
oe AEEA 
3T T= dx?y? 


The 3rd summation polynomial for twisted Edwards curves is [26]: 


a 
feai Ya, Ya) = (YYZ -YP - Y7 +5) 93 
d—a a 
+2 Yi Y2 Y; + qite- 1) — Y?Yp. 


Again, the subsequent summation polynomials are obtained by taking resultants. 
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2.1.2.6 Symmetry and Decomposition Probability 


Symmetry brought by group action on point decomposition will inevitably be accom- 
panied by a decrease in decomposition probability. For example, if a factor base F 
is invariant under addition of a 2-torsion point, then the decomposition probability 
for PDP of the mth order should decrease by a factor of 2”"—!. This is due to the same 
reason that the decomposition probability decreases by a factor of m! because the 
symmetric group Sm acts on F. 

However, this simple fact seems to have been largely ignored in the literature. For 
example, Faugére, Gaudry, Hout, and Renault explicitly stated in Sect.5.3 of their 
study that “[the] probability to decompose a point [into a sum of n points from the 
factor base] is 1” for twisted Edwards or twisted Jacobi intersections curves, despite 
the fact that the factor base is invariant under the addition of 2-torsion points [26]. 
At first glance, this may not seem a problem, as we would expect to obtain 2”~! 
solutions if we can successfully solve a PDP instance. (Unfortunately, this is also not 
true in general. We will return to it in more detail in Sect.2.1.5.3.) However, when 
estimating the cost of a complete ECDLP attack, they proposed to collapse these 
2”! relations into one to reduce the size of the factor base and thus the cost of the 
linear algebra, cf. Remark 5 of the paper. In this case, the decrease in decomposition 
probability does have an adverse effect, and their estimation for the overall ECDLP 
cost ended up being overoptimistic by a factor of at least 2”~!. 


2.1.3 Montgomery and Hessian Curves 


2.1.3.1 Montgomery Curves 
A Montgomery curve M4,g over F p for p Æ 2 is defined by the equation 
By? =x? + Ax? +x (2.1) 


for A,B e Fp such that AA +2, B AO, and B(A? — 4) #0 [32]. For P = 
(x,y) € Maz, —P = (x, —y). Furthermore, the addition and doubling formulae 
for (x3, y3) = (x1, y1) + (%2, y2) are given as follows. When (x1, y1) # (x2, y2): 


2 2 
— B(x — xX 
oe (2 =) are E (x271 n l 
x2 — xi X1X2(X2 — x1) 


— Ratan +A y) BOr-y)? 
X2 — xy (x2 — x1) 


y3 yı. 
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When (x1, y1) = (%2, y2): 


Q? — 1)? 
X3 = ’ 
“Axi (x? + Ax, + 1) 
O Qa +x + AGx74+2Ax, +1) BBX? +2Ax, + 1)? 
2Byı (2By1ı)? 


y3 yı- 


It was noted by Montgomery himself in his original paper that such curves can give 
rise to efficient scalar multiplication algorithms [32]. That is, consider a random 
point P € My ,g(Fp:) and nP = (X, : Yn : Zn) in projective coordinates for some 
integer n. Then 


Xm+n = Zm-n[(Xm = Zm)(Xn + Zn) + (Xm + Zm)(Xn aa Zr. 
Zm+n = Xm-n[(Xm = m)(Xn + Zn) i (Xm + Zim) (Xn a a ad 


In particular, when m = n 


Xan = (Xn F Z)’ (Xn T Zo, 
Zon = (4Xn Zn) ((Xn — Zn)? + (A + D/A (4Xn Zn)) 5 
4XnZn = (Xn F Z) i (Xn = Zn). 


In this way, scalar multiplication on the Montgomery curve can be performed without 
using y-coordinates, leading to fast implementation. 


2.1.3.2 Summation Polynomials for Montgomery Curves 


Following Semaev’s approach [35], we can construct summation polynomials for 
Montgomery curves. Like Weierstrass curves, the 2nd summation polynomial for 
Montgomery curves is simply fm,2 = X; — X2. Now, we consider P, Q € Map 
for P = (xı, yi) and Q = (x2, y2). Let P + Q = (x3, y3) and P — Q = (x4, ya). 
By the addition formula, we have 


B(x2y1 — x1 y2)? B(x2y1 — x1 y2)? 
w= au T 
X1X2(X2 — x1) xix (x2 + x1) 
It follows that 
2 ((x1 + x2)(x1x2 + 1) + 2Ax1ıx2) 
x3 + X4 = 7 i 
(xı — x2) 
(1 = xyx2)? 
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Using the relationship between the roots of a quadratic polynomial and its coeffi- 
cients, we obtain 


(41 — x2)?x* — 2 (x1 + x2) 1x2 + 1) + 2Ax1x2) x + (1 — x1%0)?. 


From here, we can obtain for Montgomery curve which is the 3rd summation poly- 
nomial: 


fu3(X1, X2, X3) = (X1 — X02)? X5 — 2((X1 + x2)(X1X2 + 1) 
+2AX1X2)X3 + (1 — X1 X2), 


as well as the subsequent summation polynomials by taking resultants: 


fm.m(Xı, e.r’ Xm) = Resy (fu, m-x(X1, e.’ X m-ki, X), 
x fm k+2(Xm-k, sreg Xm, X)) . 


2.1.3.3 Small Torsion Points on Montgomery Curves 


A Montgomery curve always contains an affine 2-torsion point T2. Because T> + 
T = 2h = O, -n = D. If we write T = (x, y), then we can see that y = 0 in 
order for -T> = T, as p ¥ 2. Substituting y = 0 into Eq. (2.1), we get an equation 
x? + Ax? + x = 0. The left-hand side factors into x(x? + Ax + 1) = 0, so we get 


-A+/A?—4 


=0 
. 2 


Therefore, the set of rational points over the definition field Fp» of a Montgomery 
curve includes at least two 2-torsion points, namely O and (0, 0). The other 2-torsion 
points may or may not be rational, so we will focus on (0, 0) in this section. Sub- 
stituting (x2, y2) = (0, 0) into the addition formula for Montgomery curves, we get 
that for any point P = (x, y) € Ma.z, P + (0,0) = (1/x, —y/x?). 

To be able to exploit the symmetry of addition of T) = (0, 0), we need to choose 
the factor base F = {(x, y) € E(F p») : x € V C Fp} invariant under addition of 7). 
This means that V needs to be closed by undertaking multiplicative inverses. In other 
words, V needs to be a subfield of F p, i.e., V = F p for some integer £ that divides 
n. In this case, fn is invariant under the action of x; œ> 1/x;. Unfortunately, such an 
action is not linear and hence not easy to handle in polynomial system solving. How 
to take advantage of such kind of symmetry in PDP is still an open research problem. 


2.1.3.4 Hessian Curves 


A Hessian curve H4 over F,» for p” = 2 mod 3 is defined by the equation 


2 Cryptography Core Technology 13 
x? +y? +1 =3dxy (2.2) 


for d € F,» such that 27d? Æ 1 [36]. For P = (x, y) € Hy, —P = (y, x). Further- 
more, the addition and doubling formulae for (x3, y3) = (x1, y1) + (x2, y2) are given 
as follows. 


2 2 
O ya Ym 


X2y2 — X1 y1 
When (x1, y1) # @2, ya) : o 
X1 Y2 — X2 Y1 
3S a 
X22 — X171 
yıl — x?) 
X3 = 3 3 3 
: xi — Yj 
When (x1, y1) = (42, y2): ‘ 
„= NOL=D 
3 =: 
xi Yi 


2.1.3.5 Summation Polynomials for Hessian Curves 


Following a similar approach outlined by Galbraith and Gebregiyorgis [29], we can 
construct summation polynomials for Hessian curves. First, we introduce a new 
variable T = X + Y, which is invariant under point negation. The 2nd summation 
polynomial for Hessian curves is simply fx,2 = Ti — Tr. Now let 


(x1, Yi, t1,%2, Y2, t2, X3, Y3, t3) € Foa : (xi, Yi) € Ha(F p), i = 1, 2, 3; 


Z= . 
(x1, y1) + @2, y2) + (x3, y3) = O; xi + yi = ti, i = 1,2,3 


Clearly, Z is in the variety of the ideal J C Fp [X1, Y1, Ti, X2, Y2, To, X3, Y3, T3] 
generated by 


X? + Y? +1-—3dX;Y;, i = 1,2, 3; 
(X3 — XD — Yi) — (X2 — X1) (Y3 — Yı); 
X: + Yi —- Ti, i = 1,2,3 
Again, we compute the elimination ideal 7 N F „[T;, T2, T3] and obtain a principal 


ideal generated by some polynomial. After removing the degenerate factors, we can 
obtain for Hessian curve the 3rd summation polynomial: 


fu3(T1, T2, T3) =T? TZ T3 + dp Ts + TETT? + dT) ToT + dT, 73 — TP+ 


TT} Ts + AT\TPT3 + AT, DT? + 3d7T) TT3 + 27) T> + 271 73+ 
2dT, + dT?T} — T? + 2T)T3 + 2dT> — T? + 2dT3 + 3d”, 
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as well as the subsequent summation polynomials by taking resultants: 


fH, mT, .--, Im) = Resy (fH,m-k(T1; wees Tm=k-1; T), fH,k+2(Tm—k> +++» Tm, T)) : 


2.1.3.6 Small Torsion Points on Hessian Curves 


As we shall see in Sect.2.1.4.1, we will compare elliptic curves in various forms 
that are isomorphism to one another over the same definition field. As a result, we 
will only experiment with those Hessian curves that include 2-torsion points like 
Montgomery or (twisted) Edwards curves. Because To + To = 2T = O, it follows 
that -D = D. If we write T, = (x, y), then we can see that x = y in order for 
-Dh = T, as —T> = (y, x). Substituting x = y into Eq. (2.2), we get an equation 
2x3 — 3dx? + 1 = 0. Therefore, a Hessian curve Hy (F p”) has a 2-torsion point (¢, ¢) 
if the polynomial 2X? — 3d X? + 1 has a root ¢ in F p- In this case, the addition of 
this 2-torsion point to a point (x, y) would give a point (x’, y’), where 


, by — 07x 

= ay 
g° — xy 

y i 
¿= xy a 


Obviously, the typical factor bases are not invariant under addition of this 2-torsion 
point in general. 

A Hessian curve always contains a 3-torsion point T; such that 37; = O [36]. If 
we let T3 = (x, y), then we see that 2(x, y) = —(x, y) = (y, x), substituting which 
into the doubling formula, we get 


yd — x3) 
guage 
xo? — 1) 
~-yp 7 


Because x and y cannot be zero at the same time, we have x? — y? = 1 — x? = 


y? — l,orx? = y? = 1.Nowbecause p” = 2 mod 3, F p» does not have any primitive 
cubic roots of unity, x = y = 1 and 73 = (1, 1). By the addition formula, if P = 


(x, y), then 
y =x Hy 
1—= xy 1— xy) 


P+T =.) +0D=( 


However, for P € F, we only know thatt = x + y € V C Fp, but we know nothing 
about | — xy, which can lie outside of V. Therefore, again, typical factor bases are 
not invariant under addition of this 3-torsion point in general. Therefore, it is not 


pi 
nn 
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Fig. 2.1 Experimental m|p |Curve Time|Dreg|Matcost Rank 

results on PDP solving for Hess 0 6 12336 8 I 
the case of n = 5 OSSLA : 

239 Weierstrass 0 6 41259.0 1 

Montgomery 0 6 61239.0 4 

3 tEdwards 0 6 6308.4 4 

Hessian 0 6 41420.4 1 

251 Weierstrass 0 6 42132.0 1 

Montgomery 0 6 61127.9 4 

tEdwards 0 6 6308.4 4 

Hessian 3.990) 19/12066100000 1 

239 Weierstrass |3.680| 19/12064700000 1 

Montgomery|3.489| 1811399100000 5 

4 tEdwards 0.150) 18 54093000 5 

Hessian 3.459} 19/12069800000 1 

251 Weierstrass |3.659| 1912066400000 1 

Montgomery|3.280} 18/11401700000 5 

tEdwards 0.119) 18 54102900 5 


clear how to exploit such symmetry brought by addition of small torsion points for 
Hessian curves. 


2.1.4 Experiments on PDP Solving 


This section shows the results of our experiments conducted to compare the compu- 
tational complexity of PDP on four different curves: Hessian(H), Weierstrass(W), 
Montgomery(M), and twisted Edwards(t £). 


2.1.4.1 Experimental Setup 


As explained in Sect.2.1.2.1, we focus on PDP in these experiments as the linear 
algebra step is already well understood. Furthermore, we focus on the bottleneck 
computation in PDP, namely, the cost of the F4 algorithm for computing Grob- 
ner bases of the polynomial systems obtained after rewriting using the elementary 
symmetric polynomials and applying the Weil restriction technique to summation 
polynomials. This way we will be taking advantage of the symmetry of Sm acting 
on point decompositions. However, we did not exploit symmetry of any other group 
actions. This is because we want to compare the intrinsic computational complexity 
of PDP and hence only consider the symmetry that is present in all curves. Exploiting 
further curve-specific symmetry whenever possible will result in a further speedup, 
but it would be independent of our findings here. 
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2.1.4.2 Experimental Results 


Figure 2.1 presents our experimental results for the case of n = 5. Here, we choose 
our factor base by taking V as the base field F, of F». All our experiments were 
performed using the MAGMA computation algebra system (version 2.23-1) on a 
single core of an Intel Xeon CPU E7-4830 v4 running at 2 GHz. Comparisons to 
solve each PDP were performed by running time (in second), Dreg, Matcost, and 
Rank. The “Dreg” is the maximum step degree reached during the execution of the 
F4 algorithm, which is referred to as the “degree of regularity” in the literature [29] 
and provides an upper bound for the sizes of the Macaulay submatrices involved in 
the computation, the “Matcost” is a number output by the MAGMA implementation 
of the F4 algorithm and provides an estimate of the linear algebra cost during the 
execution of the F4 algorithm, and finally, the “Rank” is the number of linearly 
independent relations we obtain once successfully solving a PDP instance. It is an 
important factor to consider, as it determines how many PDP instances we need 
to successfully solve to have enough relations for a complete ECDLP attack using 
index calculus. We can clearly see that the PDP solving time and Matcost for twisted 
Edwards curves are much smaller than those for the other curves. In contrast, the 
degrees of regularity for Montgomery and twisted Edwards curves are smaller than 
those of the other curves in the case of m = 4. In addition, we can see that the rank 
for Hessian and Weierstrass curves is 1 in all cases, whereas for Montgomery and 
twisted Edwards curves, it is 4 and 5 in the case of m = 3 and m = 4, respectively. 
Last but not least, although we only present the results for small p (around 8-bit 
long), here, we have some preliminary results for larger p (around 16-bit and 32-bit 
long). Apart from the slight difference in the absolute running time, all other results 
such as Dreg, Matcost, and Rank are similar, so we do not repeat them here. 


2.1.5 Analysis 


2.1.5.1 Revisit Summation Polynomial in Each Form 


As we have seen in Sect. 2.1.4.2, PDP on (twisted) Edwards curves seems easier to 
solve than on other curves. The explanation offered by Faugére, Gaudry, Hout, and 
Renault is “due to the smaller degree appearing in the computation of Grobner basis 
of p, in comparison with the Weierstrass case,” cf. Sect.4.1.1 of their paper [26]. 
Unfortunately, this cannot explain the difference between (twisted) Edwards and 
Montgomery curves as the highest degrees appearing in the computation of Grébner 
bases are the same for these two curves. Therefore, there must be other reasons. We 
have found that the total number of terms for twisted Edwards curves is significantly 
lower than that for the other curves in all cases. Naturally, this could lead to faster 
solving time with the F4 algorithm. We also note that, except for the twisted Edwards 
curves, the summation polynomials before Weil restriction for the other curves are 
all 100% dense without any missing terms. 
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2.1.5.2 Missing Terms of Summation Polynomials in (Twisted) 
Edwards Curves 


In this section, we will show that the summation polynomials for (twisted) Edwards 
curves mainly have terms of even degrees. The set of terms of even degrees is closed 
under multiplication, so intuitively, such polynomials are easier to solve, which can 
be the main reason for the efficiency gain observed in the case of (twisted) Edwards 
curves. 

We shall make this intuition precise in Theorem 2.1, but before we state the main 
result, we need to clarify our terminology for ease of exposition. When a multivariate 
polynomial is regarded as a univariate polynomial in one of its variables T, we say 
that the coefficient a; of a term a;T' is an even or odd-degree coefficient depending 
on whether i is even or odd, respectively. Note that these coefficients are themselves 
multivariate polynomials in one fewer variable. 

We say that a monomial m = Į [;_; x, e; => 0 in a multivariate polynomial in 
n variables is of even degree or simply an even-degree monomial if `; e; is even; 
that it is of odd degree or simply an odd-degree monomial otherwise. In contrast, a 
monomial is of (homogeneous) even parity if all e; are even; it is of (homogeneous) 
odd parity if all e; are odd. A monomial is of homogeneous parity if it is either of 
homogeneous even or odd parity. Note that the definition of monomials of odd parity 
depends on the total number of variables in the polynomial, which is not the case for 
monomials of even parity because we regard 0 as even. For example, the monomial 
X1X2 is a monomial of odd parity in a polynomial in x; and x2 but not so in another 
polynomial in x;,...,x, forn > 2. 

By abuse of language, we say that a polynomial is of even or odd parity if it 
is a linear combination of monomials of even or odd parity, respectively; that a 
polynomial is of homogeneous parity if it is a linear combination of monomials of 
homogeneous parity. The set of polynomials of even parity is closed under polynomial 
addition and multiplication and hence forms a subring. In contrast, a polynomial f in 


eij 


X1,...,X, of odd parity must have the form }_; c; (la x; ) for e;; odd. Therefore, 


if f is a polynomial of odd parity and g, a polynomial of even parity, then fg must 
be of odd parity. 


Theorem 2.1 Let & be a family of elliptic curves such that its 3rd summation poly- 
nomial fs3(X1, X2, X3) is of degree 2 in each variable X; and of homogeneous 
parity. Let gg.m be the polynomial corresponding to the PDP of mth order for & 
as described in Sect. 2.1.2.2. That is, 8g,m(X1,.--, Xm) = fs,m+1 (X1; ---, Xm, x), 
where x is a constant depending on the point to be decomposed. 


1. Ifm is even, then gg, has no monomials of odd degrees. 
2. Ifm is odd, then gg, m has some but not all monomials of odd degrees. 


Among the four forms of elliptic curves that we investigated in this section, only the 
(twisted) Edwards form satisfies the premises of Theorem 2.1. As we have seen in 
Sect. 2.1.4, the PDP solving time for the (twisted) Edwards form is thus significantly 
faster than that for the other forms. 
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We will prove Theorem 2.1 in the rest of this section, for which we will need the 
following lemmas. 


Lemma 2.1 Let fi(Ti,..., T, T) =aotaq,T +--+ +GnT" and f2(T,..., 
T,, T) = bo + bıT +---+b,T" be two polynomials in r + 1 variables, where a; 
and b; are polynomials inT,,..., T,. Let f(T, ..., T-) = Resr(fi, f2) be the resul- 
tant of fı and fọ regarded as two univariate polynomials in T. If both m and n are 
even, then every monomial of f is a product of an even number or none of the odd- 
degree coefficients of fı and fa and some or none of the even-degree coefficients 
of fi and fy. Specifically, the odd-degree coefficients azx+ı and box4, of fı and fa, 
respectively, appear in total an even number of times in each monomial of f. 


Proof The resultant Resr(fi, f2) of fı and fz is the determinant of the following 
(m +n) x (m+n) matrix S: 


Am Am-1 +++ ao 
am GAm-1--- ao i 
Am Am— a 
S= m 4m—-1 0 (2.3) 
Dn Dn-1 oa bo 
Dn Dn-1 bo 
. m 
bn Dn... bg 


We denote s;; as the entry at the ith row and jth column of S for 1 <i, j <m +n. 
Because both m and n are even, an even-degree coefficient az% or bo, will appear in 
Sij for which the sum of indices i + j is even. Similarly, an odd-degree coefficient 
4241 OF b2x¢41 will appear in s;; for which the sum of indices i + j is odd. Now recall 
that the determinant of S$ is defined as 


> sgn(o)S1,0(1) * S2,0(2) °° © Sm+n,o(m-+n)- 


TESn+m 


We note that the sum of the indices of any summand is 


m+n 


Xi +o(i) = (m+n)(m+n+ 1), 


i 


which is always even. Therefore, the odd-degree coefficients must appear an even 
number of times, thus completing the proof. 


Lemma 2.2 Let & be a family of elliptic curves such that its 3rd summation poly- 
nomial fs, 3(Xı, X2, X3) is of degree 2 in each variable X; and of homogeneous 
parity. Then, any subsequent summation polynomial fs m(Xı,..., Xm) form > 3 
is of homogeneous parity. 
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Proof As the summation polynomial fs ,m+1 for m > 3 is defined recursively from 
fsm and fg,3 by taking resultants 


Sem4i(X1, e. Xm+1) = Resx (fem(Xı, e... Xm-1, X), fe3(Xm, Xm+1; X)) , 


we shall prove this lemma by induction on m. Let fgm(X1,...,Xm-1, X) = 
gna XP" 4 -++ 4+ a,X +a and fe3(Xm, Xm+1, X) = boX? + bı X + bo. By the 
premise that fs,3 is of homogeneous parity, bọ and b2 must consist only of monomials 
(in Xm and Xm+1) of even parity. Furthermore, bı = CXmXm+1 for some constant c. 
This is because fg,3 is of degree 2 in each variable, for which the only monomial of 
odd parity is XmXm+1X. 

Now consider a term c X*, 41 of 


m—1 


2 
Sem4i(X1, re) Xm, Xm+1) = Com- X n41 +--+ C1Xm+1 + Co 


as a univariate polynomial in Xm+1. Again as fs,3 is of degree 2 in X, we have the 
case of n = 2 in Eq.2.3. Now X,,,; must come from b, so we can conclude that 


k =) ôi Rei yk yk 
CkX m41 _ jp, Ay, Do bz X mXm+> 


l 


where œ; a constant, 6;, yi € {0,..., 2—2) and ŝ;, €; nonnegative integers such 
that 6; + €; + k = 2"7?, We will complete the proof by showing that GENT a9 isa 
polynomial in X1,..., Xm41 of homogeneous parity for all k as follows. 


1. If k is even, then by Lemma 2.1, 6; and y; are both even or both odd in each 
summand. In either case, the product ag,a,, is a polynomial in X1,..., Xm—1 of 
even parity. It follows that each summand is a polynomial of even parity because 
it is a product of polynomials of even parity. Hence, cX k 41 18 a polynomial of 
even parity. 

2. If k is odd, the situation is similar but slightly more complicated. By Lemma 2.1, 
exactly one of 6; and y; is odd in each summan4d, say 6;. By induction hypothesis, 
ag, is a polynomial in X),..., Xm—1 of odd parity because it comes from ag, xh 
in fs,m. It follows that each summand is a polynomial of odd parity because it is 
a product of a polynomial of even parity ay, bi býi and a polynomial of odd parity 
ap, X* XE +1; Hence, c, X. s 4, 18 a polynomial of odd parity. 


By Lemma 2.2, gg in(X1,.--, Xm) = fem4i1(X1,---, Xm, xX) is of homogeneous 
parity. Obviously, the monomials of even parity will remain of even degree after x is 
substituted. If m is even, then the monomials of odd parity in fs,m+1 will become of 
even degree after x is substituted because an even number of odd numbers sum to an 
even number. Similarly, if m is odd, then the monomials of odd parity in fg,n+1 will 
become of odd degree after x is substituted. However, those odd-degree monomials 
that are not of homogeneous parity, e.g., X ix 2, cannot appear in gg,,, by Lemma 2.2. 
This completes the proof of Theorem 2.1. 
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2.1.5.3 What Price for a Highly Symmetric Factor Base? 


Last but not least, we discuss the price needed to pay to have a highly symmetric 
factor base F that is invariant under more group actions in addition to that of the 
symmetric group Sm. As previewed in Sect. 2.1.2.6, we would expect that the effect 
of the decrease in decomposition probability due to additional symmetry in F could 
be offset by that of the increase in number of solutions. For example, let us reconsider 
the group action of addition of T) in Sect. 2.1.2.4. If we could get 2”"~! solutions, then 
the loss of the factor of 2”-! in decomposition probability would be compensated. 
This way everything would be the same as if there were no such symmetry, and we 
could exploit the additional symmetry at no cost. 

Unfortunately, this proposition is false in general. Consider an example of m = 4. 
Let Q; = P; + T, fori = 1, 2,3, 4. We can write down all 2”7! = 8 possible ways 
of a point decomposition under this group action: 


Pi + Po + P3+ P4 =Q; + Q2 + Ps + Py 
= Qı + Po + Q3 + Py=Qi + Po + Ps + Q4 
= Pi + Q2 + Q3 + P4 =P, + Q2 + Ps + Q4 
= Pi + P2 + Q3 + Q4 =Q1 + Q2 + Q3 + Q4. 


It is easy to find that we have only five linearly independent relations from these 
eight relations, as there are nontrivial linear combinations summing to zero, e.g.: 


(Pi + Po + Ps + Ps) — (Qi + Qo + P3 + Pa) — (Pi + Po + Q3 + Qa) 
+ (Qı + Q2 + Q3 + Q4) =O. 


As explained in Sect. 2.1.4.1, the factor bases for Montgomery and twisted Edwards 
curves are invariant under addition of 2-torsion points. For m = 3, we achieve max- 
imum rank of 2”-! = 4. For m = 4, as we have explained above, we can only have 
rank 5, which is strictly less than the maximum possible rank 2”7! = 8. 

Finally, we note that we have not exploited any symmetry for Hessian curves in our 
experiments. However, the rank for Hessian curves is always | in all our experiments. 
This shows that the factor base we have chosen for Hessian curves is not invariant 
under addition of small torsion points, as the rank would be > 1 otherwise. 


2.1.6 Concluding Remarks 


In this section, we experimentally explored index-calculus attack on ECDLP over 
different forms such as twisted Edwards, Montgomery, Hessian, and Weierstrass 
curves under the totally fair conditions as they are isomorphic to each other over 
the same definition field F,. and showed that twisted Edwards curves are clearly 
faster than others. We investigated the summation polynomials of all forms in detail, 
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found that big differences exist in the number of terms, and proved that monomials 
of odd degrees in summation polynomials on twisted Edwards curves do not exist. 
We showed that this difference causes less solving time of index-calculus attack on 
ECDLP over twisted Edwards than others. 


2.2 Analysis on Ring-LWE over Decomposition Fields 
2.2.1 Introduction 


The ring variant of learning with errors (Ring-LWE) based cryptography [15, 16] is 
one of the most attractive research areas in cryptography. Ring-LWE has provided 
efficient and provably secure post-quantum cryptographic protocols, which include 
homomorphic encryption (HE) schemes [4, 5, 9]. The development of the efficiency 
and security of both post-quantum cryptography and HE is strongly desirable. In 
fact, the standardization of post-quantum cryptography is under development by the 
National Institute of Standards and Technology. Moreover, HE schemes that enable 
us to execute the computation on encrypted data without decryption have many 
applications in cloud computing. 

Ring-LWE is characterized by two probabilistic distributions, modulus param- 
eters (integers) and number fields, as detailed in Sect. 2.2.2.4. Usually, cyclotomic 
fields are used as the underlying number fields to increase efficiency and security 
[17]. However, especially in the case of HE schemes, improving the efficiency of 
the encryption/decryption procedures and homomorphic arithmetic operations on 
encrypted data while ensuring security remain important tasks. 

To construct an HE scheme that can simultaneously encrypt many plaintexts 
efficiently, Arita and Handa proposed the use of a decomposition field, which is 
contained in a cyclotomic field with prime conductors, as an underlying number 
field for Ring-LWE [1]. (Sect. 2.2.3 presents the details of decomposition fields and 
of Arita and Handa’s idea.) Arita and Handa’s HE scheme, which is called the subring 
HE scheme, is indistinguishably secure under a chosen-plaintext attack if the decision 
variant of Ring-LWE over the decomposition fields is computationally infeasible. 
Arita and Handa’s experiments [1, Sect. 5] showed that the performance of the subring 
HE scheme is much better than that of the FV scheme based on Ring-LWE over £th 
cyclotomic fields with prime numbers £, as implemented in HElib [11]. 

As for the security of the subring HE scheme, Arita and Handa remarked that in 
the case of decomposition fields, some of the security properties of Ring-LWE in the 
case of cyclotomic fields are also satisfied. More concretely, there exists a quantum 
polynomial-time reduction from the approximate shortest vector problem on certain 
ideal lattices to Ring-LWE over decomposition fields, and the equivalence between 
the decision and search variants of Ring-LWE over decomposition fields is satisfied. 

However, solving Ring-LWE is reduced to solving certain problems on lattices, 
such as the closest vector problem (CVP) and the shortest vector problem, and the 
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difficulty of problems on lattices depends heavily on the structure and given bases 
of the underlying lattices. For example, if the shortest vector is much shorter than 
the second shortest vector in a certain lattice £, then the shortest vector problem 
for lattice £ would be easy. This means that the underlying number fields affect the 
difficulty of lattice problems arising in Ring-LWE. Hence, to ensure the security 
of the subring HE scheme, experimental or theoretical analyses of (lattice) attacks 
should be performed. However, [1] does not provide any such analysis. 

In this study, we provide an experimental analysis of the security of Ring-LWE 
over decomposition fields. More precisely, we compare the security of Ring-LWE 
over decomposition fields and of Ring-LWE over the £th cyclotomic fields with 
some prime numbers £. In our experiments, we reduce the search Ring-LWE to the 
(approximate) CVP on certain lattices in the same way as Bonnoron et al.’s analysis 
[3] because the target of Bonnoron et al.’s analysis is Ring-LWE optimized for HE. 
We use Babai’s nearest plane algorithm [2] and Kannan’s embedding technique [12] 
to solve the CVP. We then compare the running times, success rates, and Hermite 
root factors. (The root Hermite factor [10] is usually used to evaluate the quality of 
lattice attacks.) We also compare the experimental results of lattice attacks against 
Ring-LWE over various decomposition fields to find those fields that provide weak 
Ring-LWE. 

Our experimental results indicate that the success rates and Hermite root factors 
for the decomposition fields are almost the same as those for the cyclotomic fields. 
However, the running time for decomposition fields is longer than that for cyclotomic 
fields. Moreover, the difference in running time increases as the rank of the lattices 
increases. 

Therefore, we believe that Ring-LWE over decomposition fields is more secure 
against the above lattice attacks than that over cyclotomic fields because the ranks 
of the lattices occurring in our experiments are much lower than the ranks of the 
lattices used in practice. This means that to construct HE schemes (or schemes of 
other types), fewer parameters are needed for Ring-LWE over decomposition fields 
than for Ring-LWE over cyclotomic fields. Therefore, as a result of our analysis, 
we believe that Ring-LWE over decomposition fields can be used to construct more 
efficient HE schemes. 


2.2.2 Preliminaries 


In this section, we briefly review the notation of lattices, Galois theory, number fields, 
and Ring-LWE. Throughout this study, Z, Q, R, and C denote the ring of (rational) 
integers, field of rational numbers, field of real numbers, and field of complex num- 
bers, respectively. For a positive integer m € Z, we suppose that any element of 
Z/mZ is represented by an integer contained in the interval (—m/2, m/2] A Z. 
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2.2.2.1 Lattices 
An m-dimensional lattice is defined as a discrete additive subgroup of R”. It is 


well known that for any lattice L C R”, there exist R-linearly independent vec- 
tors bj,...,andb, € R” such that £ = D isien Zb; := D raisa ajb; | a; € Z }. 


In other words, for a matrix B = (bj, ..., b,) whose ith column vector is bj, we 
have £ = {Bx | x € Z"}. Then, we say that {b;,..., b,} is a lattice basis of £, and 
B is the basis matrix of £ with respect to {b;,...,b,}. The value n is called the 


rank of £, and it is denoted by rank (£). There are infinite bases for a lattice. In fact, 
for any unimodular matrix U, all column vectors of UB also form a basis of £. An 
important invariant of £ is the determinant defined as det(L) := ydet (BB‘). This 
determinant is independent of basis. 

There are various computationally hard problems on lattices. Here, we explain 
the CVP, which is a well-known problem on lattices. Given a lattice £ and target 
vector t € R” x L, the CVP on (L, t) is the problem of finding a vector x € £ such 
that for all vectors y € £L, we have ||t — x|| < ||t — y||. For areal number y > 1, the 
approximate CVP on (<£, t, y) is the problem of finding a vector x € £ such that 
for all vectors y € L, we have ||t — x|| < y||t — y||. Babai’s nearest plane algorithm 
and Kannan’s embedding technique are basic algorithms for solving the approxi- 
mate CVP. Almost all known problems on lattices that are useful for constructing 
cryptographic protocols become more difficult as the ranks of the underlying lattices 
increase, and the quality of the two algorithms mentioned earlier depends on ranks 
of input lattices. 

Breaking some cryptographic protocols can be reduced to solving certain com- 
putational problems on lattices, including the (approximate) CVP [3, 8]. To solve 
such problems on lattices, we usually use lattice basis reduction algorithms, which 
transform a given basis of a lattice into a basis of the same lattice that consists of 
nearly orthogonal and relatively short vectors. In fact, an input of Babai’s nearest 
plane algorithm is an (LLL) reduced basis, and Kannan’s embedding technique out- 
puts an appropriate vector from the reduced basis. In our experiments, to solve CVP 
using Babai’s nearest plane algorithm and Kannan’s embedding technique, we use 
the LLL algorithm [13] and BKZ algorithm [7, 19], which are well-known algorithms 
for computing such bases. 

The quality of basis reduction algorithms is usually estimated by the root Hermite 
factor, which is defined as follows: Let b be the shortest vector of a basis of a lattice 
£ with rank n, which has been reduced by a basis reduction algorithm A. Then, the 
root Hermite factor ôa, z is defined as a constant satisfying 6% ç := ||b]|/ det(£)!/". 
Better basis reduction algorithms provide smaller Hermite root factors. 


2.2.2.2 Galois Theory 


To describe decomposition fields, we need to describe Galois theory. 
Let K be a field and L an extension field of K; we denote this situation by L/K. 
The field L is a K-vector space, and the degree of extension of L/K, denoted by 
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[L : K], is defined as the dimension of L as K-vector space. If M is a subfield of L 
containing K as a subfield, i.e., K C M C L, then we call M an intermediate field 
of L/K.If L/K satisfies [L : K] < oo, then L/K is called a finite extension of K. 
If M is an intermediate field of L/K with [L : K] < œœ, then we have [L : K] = 
[L : M][M : K]. If for any a € L, there exists a nonzero polynomial f(x) € K[x] 
such that f (œ) = 0, then L/K is called an algebraic extension of K. It is known that 
all finite extensions are algebraic extensions. 

From now on, we suppose that L/K is a finite algebraic extension. For any 
a € L, the minimal polynomial over K of œ is defined as the monic polynomial 
f(x) € K[x] with the lowest degree of all polynomials in K [x] that vanish at æ. We 
denote Irr(a, K)(x) as the minimal polynomial over K of a. Note that the minimal 
polynomial over K of a coincides with the monic irreducible polynomial over K 
that vanishes at œ. For a subset $ C L, we denote K (S) as the smallest subfield of L 
among subfields containing K and S. We call K (S) the field generated by S over K. 
If L is generated by one element 0 € L over K,i.e., L = K (0), then we have an iso- 
morphism L = K[x]/ (Irr(@, K)(x)) by 0 > x (mod. (Irr(6, K)(x)). This implies 
that [K (0) : K] = degIrr(a, K). 

Next, we describe separable, normal, and Galois extensions of fields. If 
Irr(a, K)(x) for any « that has no multiple roots, then L/K is called a separable 
extension of K. If L contains all roots of Irr(a, K)(x) for any a € L, then L/K is 
called a normal extension of K. If all algebraic extensions of K, including infinite 
algebraic extensions, are separable, then K is called a perfect (field). It is known 
that fields with characteristic zero and any finite field are perfect, and that any finite 
separable extension field can be generated by one element. If L/K is a separable 
and normal extension of K, then L/K is called a Galois extension of K. Let Q be 
a sufficiently large field containing K such that any ring-homomorphism ¢ fixing 
K, i.e., (a) = a for any a € K, to L satisfies (L) C Q. We define the set of all 
ring-homomorphisms by fixing K to the range L to Q as follows: 


Homg (L, Q) := {o : L > Q | o(a) =a, Ya € K}. 


(Note that any nonzero ring-homomorphism between fields is injective.) Let L/K 
be separable with [L : K] =n and L = K (0). Let 0 = 01, ..., 0, be all roots 
of Irr(0, K)(x). For any o € Homg (L, Q), we have o (Irr(0, K)(@)) = Irr(0, K) 
(o (0)) = 0. This means that o (0) = 6; for some i = 1,..., n. This then implies 
#Hom x (L) = n. (Any t € Hom, (L, Q) is completely determined by the image of 
6 under t because Tt fixes K.) 

Moreover, if L/K is normal, then ø induces an isomorphism L = L. Note 
that L = K (0) = K (0;) for any i = 1,...,n because these fields are isomorphic 
to K[X]/ (drr(8, K)). Therefore, we may take L as Q and can write Autg (L) = 
Hom x (L, 2). 

Now, we can describe the fundamental theorem of Galois theory (for finite 
field extensions). Let L/K be a finite Galois extension of K. Then, we can write 
Gal(L/K) = Autęg (L). For any subgroup H C Gal(L/K) and an intermediate field 
M of L/K, we define 
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L” := fae L | o(a) =a,Vo € H}, 
Gm := {o € Gal(L/K) | o(a) = a, Ya € M}. 


We note that L/M is a Galois extension with Gal(L/M) = G m. It is not difficult 
to see that L” is an intermediate field of L/K and that Gy is a subgroup 
of Gal(L/K). We can define two maps with respect to L/K. One is a map ® 
from A := {M _ C L |M is an intermediate field of L/K}to B := {H C Gal(L/K) | 
H is a subgroup of Gal(L/K)} by M +> Gy. The other is a map Y from B to A by 
H + L”. The fundamental theorem of Galois theory is as follows: 


Theorem 2.2 Let L/K, A, B, ®, and Y be as above. Then, the following statements 
are true: 


(1) There is a one-to-one correspondence between A and B. More precisely, ® and 
W are inverse maps of each other. 

(2) If M, and Mp) are intermediate fields of L/K with Mı C Mo, then we have 
®(M2) C (Mı). Similarly, if Hı and H, are subgroups of Gal(L/K) with 
H, C Mp, then we have Y (H2) C (A). 

(3) Let Mı, M2, Hy and H, be as in (2). Then, we have (Hy : Hi) = #H2/H, = 
[Y(H)) : Y(H2)] and [Mz : Mı] = (®(M)) : ®(M2)). 

(4) A subfield M of L/K is a Galois extension of K if and only if Gy = ®(M) is 
a normal subgroup of Gal(L/K). Moreover, if Gy = Gal(L/M) is a normal 
subgroup of Gal(L/K), then we have 


Gal(L/K)/Gal(L/M) = Gal(M/K). 


In particular, if Gal(L/K) is an abelian group, then all subfields of L/K are 
Galois extensions of K. 


For a proof of Theorem 2.2, see [18] for example. (It is easy to prove (2) of 
Theorem 2.2 from the definitions of ® and W.) 


2.2.2.3 Number Fields 


To describe Ring-LWE and decomposition fields, which play central roles in this 
paper, we need some notations from algebraic number theory. 

An (algebraic) number field is a finite extension field of Q. Let K be a number field 
with extension degree [K : Q] =n. An element a € K is called an algebraic integer 
if there exists a monic polynomial f € Z[x] such that f(a) = 0. The ring of integers 
Ox of K is defined as a subring of K consisting of all algebraic integers of K . The ring 
Ox has an integral basis (Z-basis) {u;,..., Un}, i.e., for any element u € Ox, there 
exist integers a1, ..., a, such that u is uniquely written as u = een aiui. Itis well 
known that any (integral) ideal J of Ox is uniquely factored into products of some 
prime ideals, i.e., there exist prime ideals P|, ..., Pm satisfying I = PI ++ Pom for 
ei > 1. If I = pOx fora prime number p and K is a Galois extension of Q, then we 
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have Ox /P; = F p for some d € N and all e;’s are mutually equal. Moreover, we 
have med = n, where e := e;, and if all e;’s are equal to 1 (resp. all e;’s and d are 
equal to 1), then we say that p is unramified (resp. splits completely) in K. Any prime 
ideal of Ox is a maximal ideal in Ox, and thus we have P; + P; = Ox foranyi Æ j. 
This induces an isomorphism of rings Ox /P1 -- -Pm = Ox /P1 X --- X Ok/Pm. 


2.2.2.4 Ring-LWE Problem 


Let K and Ox be as above. Let Xsecret and Xerror be probabilistic distributions on Ox 
and let p be an integer. We denote by Ox,, the residue ring Ox /p Ox. For a proba- 
bilistic distribution x on a set X, we write a < x when a € X is chosen according 
to x. We denote U(X) as the uniform distribution on X. The Ring-LWE distribution 
on Ox,p», denoted by RLWEK, p, xenon, xec» 18 defined as a probabilistic distribution that 
takes elements of the form (a, as + e) with a <— U(Ok,»), S — Xsecret, and with 
e <— Xerror- The Ring-LWE problem has two variants. One is the problem of dis- 
tinguishing RLWEXK, p, yeror.Xsec from U (Okx,p X Ox,p), which is called the decision 
Ring-LWE problem. The other is a problem of finding s € Ox,,, given arbitrarily 
many samples (a;, a;8 + €;) € Ox,» X Ox,p chosen according to RLWE x, » ¥.,,,., 
which is called the search Ring-LWE problem. 

The Ring-LWE problem is expected to be computationally difficult even with 
quantum computers. It is proved that the decision Ring-LWE problem is equivalent 
to the search problem if K is a cyclotomic field and if p is a prime number and 
(almost) splits completely in K [16]. In addition, this equivalence is generalized to 
the cases where K/Q is a Galois extension and where p is unramified in K [6]. 
Moreover, there is a quantum polynomial-time reduction from the search Ring-LWE 
to the shortest vector problem on certain ideal lattices. 


Xsec? 


2.2.3 Ring-LWE over Cyclotomic and Decomposition Fields 


In this section, we describe why Arita and Handa proposed the use of decomposi- 
tion fields as the underlying number fields of Ring-LWE to construct efficient HE 
schemes. 


2.2.3.1 Cyclotomic Fields and Decomposition Fields 


First, we briefly review cyclotomic fields. For a positive integer m, let ¢,, € C be 
a primitive mth root of unity and n = g(m), where ¢(-) denotes Euler’s totient 
function. Then, K := Q (¢,,) is called the mth cyclotomic field. The ring of integers 
of K coincides with R := Z[¢,,]. Any prime number p that does not divide m is 
unramified in K, and if p = 1 (mod. m), then p splits completely in K. Here, K /Q 
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is a Galois extension of degree [K : Q] =n, and its Galois group Gal(K/Q) is 
isomorphic to (Z/mZ)”*. 

Next, we describe the decomposition fields of number fields. Let L be a number 
field, and suppose that L/Q is a Galois extension and that its Galois group G := 
Gal(L/Q) is a cyclic group. Let p be a prime number that is unramified in L and 
satisfies pO, = P---P,, where the P;’s are the prime ideals of Oz. Let Gz be 
a subgroup of G that consists of all elements p fixing all P;, i.e., o (Pi) = P; for 
1 <i < g, and Z is the fixed field of Gz. Then, we call Z the decomposition field 
with respect to p. The field Z is a number field and the ring of integers of Z is Oz = 
OL N Z. Suppose p; := Oz N Pi. Then, we have pOz = pı --- pg. A generator o 
of Gz acts on O; /P; = Fpa as the pth Frobenius map, i.e., o (x) = x? (mod. P;) 
for all x € Oz and for 1 <i < g. Therefore, we have Oz/p; S F, and [Z : Q] = g, 
i.e., p splits completely in Z. 


2.2.3.2 Cyclotomic Fields Versus Decomposition Fields 


Let K, L, and Z be as above and p be a prime number that is unramified in K 
and splits completely in Z. Assume that L is the £th cyclotomic field with a prime 
number £. As we mentioned in Sect. 2.2.1, cyclotomic fields are usually used as the 
underlying number fields of Ring-LWE. From the viewpoint of the efficiency of 
Ring-LWE based schemes, there are good Z-bases of the rings of integers of K and 
Z [1, 17]. As for the security of the Ring-LWE, in the cases of K and Z, both the 
equivalence and the reduction mentioned in Sect. 2.2.2.4 are satisfied because both 
K/Qand Z/Q are Galois extensions. 

The main difference between K and Z is the algebraic structures of their rings of 
integers modulo p. Because p is unramified in K, we have Ox,» = Ox/P x + X 
Ox/Px and Ox /P; = Fpa for 1 <i < k and for d > 1, where the P;’s are prime 
ideals in Ox lying over p, i.e., pOg = Pı -+ - Pp. The FV scheme [9], which is an 
HE scheme based on Ring-LWE, uses Ox, as its plaintext space, and thus, the FV 
scheme (or any HE scheme with the same plaintext space) can encrypt and execute 
several additions of dk = n = [K : Q] plaintexts in F, simultaneously. However, 
the FV scheme cannot execute the multiplication of the same number of plaintexts 
in F, simultaneously. To execute the multiplication of plaintexts in F,, we can only 
use F, x --- x F, (the direct product of k finite fields) as the plaintext space. 

In contrast, because p splits completely in Z, we have Oz, = Oz/p, xX: x 
Oz/P, and Oz/p; = F, for any 1 < i < g, where the p;’s are prime ideals in Oz 
lying over p. This means that one can encrypt g = [Z : Q] plaintexts simultaneously. 
Moreover, one can execute additions and multiplications of the same number of 
plaintexts in F, simultaneously. Because the extension degrees g and n are directly 
related to the ranks of the lattices occurring in known lattice attacks, we should set 
g ~n to compare the security of Ring-LWE over these fields. Therefore, the HE 
scheme over Z can encrypt and operate d times as many plaintexts as the FV scheme 
over K simultaneously. 


28 C.-M. Cheng et al. 


Remark 2.1 1. If p = 1 (mod. m), then p splits completely in K (recall that K is 
the mth cyclotomic field), and then there is no advantage to using decomposition 
fields. However, for some cryptographic applications, we want to use a small p, 
e.g., p = 2 [1]. Moreover, to avoid lattice attacks, the extension degree [K : Q] 
must be large, as we discussed above. Thus, we cannot expect p = | (mod. m) 
for practical parameters in some applications. 

2. By the Hensel lifting technique, forr > 1 andg := p”, we have Oz, = Z/qZ x 
+++ x Z/qZ. 


2.2.4 Our Experimental Analysis 


In this section, we present our experimental results on lattice attacks against Ring- 
LWE over decomposition fields and cyclotomic fields. First, we explain lattice attacks 
in our experiments. 


2.2.4.1 Lattice Attack in Our Experiments 


In our experiments, we reduce the search Ring-LWE to a CVP (or approximate CVP) 
in the same way as Bonnoron et al.’s analysis [3] because the target of Bonnoron 
et al.’s analysis is Ring-LWE optimized for HE. We describe this approach briefly 
in the case of decomposition fields. Let Oz and p be as in Sect. 2.2.3.1. Setq := p” 
for r > 1. Let {u1, ..., 4g} be a Z-basis of Oz, which is a good basis, as shown 
in [1, Lemma 3]. We sample vectors a = (a1, ..., ag), S= (S1, ..., Sg) and e = 
(e1, ..., €g) from U (Z£), Dzs o, and Dzs o., respectively, where Dz: o denotes the 


discrete Gaussian distribution with mean 0 and variance o°. 


We put a := eee ailli, S = Lise Sihi, € := Dees e;4;, and b := as + 
e= peer bini (mod. q). Then, (a, b) is a Ring-LWE instance over Z. Note that 
to use Ring-LWE to construct HE schemes, the value o, and ce should be suffi- 
ciently small because the €.5-norm ||S||.. directly affects the growth of noise after 
multiplication. In our experiments, we set os = 1 and of = 8 according to [14]. By 
comparing all coefficients of both sides, we get As + e = (by, . . . , bg)’ = b, where 
A is a matrix. (For any vector v, v’ means its transpose.) If we set A’ as (A I), then 
we have A’(s_e)' = b (mod. q), where I denotes the g x g identity matrix. From 
the choice of 5;’s and e;’s, our target vector (s e)‘ is a very short vector from among 
all solutions to A’y = b, and thus, we can expect that our target vector can be found 
by solving the (approximate) CVP on the lattice £ = {x € Z” | A’x = 0 (mod. q)} 
and on w := (0 b)‘, which is a solution to A’y = b. 


We take 
= I Og 
sae (á ql ) 
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as a basis matrix of L, where 0,,, denotes the g x g zero matrix. We reduce the basis 
matrix B using the LLL and BKZ algorithms with block size 6 = 10. (In practice, 
B should be 10 or 20.) Let B,.q be a reduced basis of B. We input B,.g and w to 
Babai’s nearest plane algorithm. The quality of the results of Babai’s nearest plane 
algorithm depends on the quality of the basis reduction algorithms used to compute 
the reduced input bases, and thus, we compute the root Hermite factor for Brea. 

In contrast, Kannan’s embedding technique takes a basis matrix 


B —w 
C= 
a M ) 
as input, and we set M = 1 according to the result of an experimental study on Kan- 


nan’s embedding technique for LWE [20]. We also use the LLL and BKZ algorithms 
with 6 = 10 to reduce the above basis matrix. 


Remark 2.2 In the case of -cyclotomic fields with prime numbers £, we use 
{1,&e,..., eo} as a Z-basis, which is also a good basis [17]. 


Remark 2.3 For 1<r’<r and q':= p", we can obtain samples of 
RLWE x q',xenor, xsee LOM Samples Of RLWE K 4, yeror,Xsec DY a natural projection Oz, 4 > 
Oz,q by a +> a (mod. q’). In our experiments, we use a small r’ to reduce running 
times. In our experimental results, we only show r’. 


2.2.4.2 Experimental Results 


We used a computer with 2.00GHz CPUs (Intel(R) Xeon(R) CPU E7-4830 v4 
(2.00GHz)x111) and 3 TB memory to conduct the experiments. The OS was Ubuntu 
16.04.4. We implemented the code for sampling Ring-LWE instances in SageMath 
version 7.5.1. We also used Magma V2.23-1 to execute lattice attacks. We took 100 
samples and performed lattice attacks on them. 

We show our experimental results in Tables 2.1 and 2.2 for p = 2. Table 2.1 shows 
that there is not a considerable difference between the experimental results of cyclo- 
tomic fields and those for decomposition fields. In contrast, Table 2.2 shows that 
Kannan’s embedding technique is much faster than Babai’s nearest plane algorithm. 

This implies that the behaviors of the basis reduction algorithms heavily depend 
on the structure of the input lattices. This is a reason why experimental analyses 
are necessary for ensuring the security of lattice-based schemes (or other problems). 
Table 2.2 also shows that the running times for the decomposition fields become 
longer than those for cyclotomic fields as g (or £ — 1) increases. Therefore, we can 
expect that decomposition fields provide Ring-LWE that is more secure against the 
lattice attacks described in Sect. 2.2.4.1 than £th cyclotomic fields because the ranks 
of the lattices occurring in our experiments are very low compared to the ranks of 
lattices used in practice. This means that we can use decomposition fields with lower 
extension degrees than would be needed for £th cyclotomic fields, and the use of 
such number fields makes Ring-LWE-based schemes more efficient. Therefore, as a 
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Table 2.1 Experimental results on Babai’s nearest plane algorithm for p = 2 


£ 59 6183| 73 2089 | 83 4051 | 131 5419 173 14449 227 9719 
g - 58 - 72 - 81 = 129 = 172 = 226 
Lattice |118 16 |146 44 66 |162 262 258 346 344 454 452 
rank 

r 20 20 20 20 20 20 30 30 30 30 30 30 
Number | 93 00 | 100 00 00 |100 100 100 40 37 15 14 
of sam- 

ples 

Success | 100 00 | 100 00 00 |100 100 100 100 89 0 0 
rate (%) 

Average | 1.014 | 1.014 | 1.014 | 1.014 | 1.014 | 1.014 | 1.020 1.020 1.020 1.020 1.089 1.021 
root 

Hermite 

factor 

Average | 72.22 | 88.97 | 218.4 | 238.2 | 443.3 | 456.1 | 12790.5 | 11744.6 | 54763.0 | 57862.3 | 231816.1 | 237846.9 
running 

time (s) 

Ratio of| — 123.2 | — 109.0 | — 102.9 | — 91.8 = 105.7 - 102.6 
running 

times 

(%) 


The columns for which the values g are indicated show the results for decomposition fields; the 
other columns show the results for cyclotomic fields 

The “ratio of running times” is the ratio of the average of running time for a decomposition field to 
that of a cyclotomic field for each g 


Table 2.2 Experimental results on Kannan’s embedding technique for p = 2 


£ 59 6183173 2089 | 83 4051 | 131 5419 173 14449 227 9719 
g - 58 = 72 = 81 = 129 = 172 = 226 
Lattice | 119 17 147 | 145 167 163 263 259 347 345 455 453 
rank 

F 20 20 20 20 20 20 30 30 30 30 40 40 
Number | 100 00 | 100 | 100 | 100 | 100 100 100 100 100 23 21 
of sam- 

ples 

Success | 100 00 | 100 | 100 | 100 | 100 100 100 100 100 100 100 


rate (%) 
Average | 10.4 0.7 | 36.7 | 41.4 | 92.3 | 97.6 | 4714.6 | 5556.7 19387.5 | 25138.7 | 136978.2 | 159772.6 
running 
time (s) 
Ratio of, — 03.5 | — 112.7 | = 105.7 | — 117.9 = 129.7 - 116.6 
running 
times 
(%) 


We computed the root Hermite factor for the reduced bases, but we do not show them because the 
success rates in these results are 100% 
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50 70 90 110 130 150 170 190 210 230 
rank of lattice 


Fig. 2.2 Average running times of Kannan’s embedding technique for cyclotomic and decomposi- 
tion fields with respect to p = 2, 3,5, 7, 11. The label “p = 2_cyclotomic” indicates the results of 
the cyclotomic fields shown in Table 2.2, and the other labels indicate the results for decomposition 
fields with respect to the corresponding prime numbers p. We set modulus parameter g = p” so 
that these moduli have the almost same bit sizes. We only show the average results on at least 10 
samples 


result of our analysis, we believe that Ring-LWE over decomposition fields can be 
used to construct more efficient HE schemes. 

We also conducted experiments for decomposition fields with respect to p = 
3,5,7,11 to find decomposition fields that provide weak Ring-LWE instances 
(Fig. 2.2). In these experiments, we could not find decomposition fields that pro- 
vide weak Ring-LWE. 


References 


1. S. Arita, S. Handa, Subring homomorphic encryption, in Proceedings of ICISC 2017. LNCS, 
vol. 10779 (Springer, Cham, 2018), pp. 112-136 

2. L. Babai, On Lovász’ Lattice reduction and the nearest lattice point problem. Combinatorica 
6(1), 1-13 (1986). Springer (Preliminary version in STACS 1985) 

3. G. Bonnoron, C. Fontaine, A note on ring-LWE security in the case of fully homomorphic 
encryption, in Proceedings of INDOCRYPT 2017. LNCS, vol. 10698 (Springer, Cham, 2017), 
pp. 27-43 

4. Z. Brakerski, C. Gentry, V. Vaikuntanathan, (Leveled) fully homomorphic encryption without 
bootstrapping, in Proceedings of ITCS 2012 (ACM New York, NY, USA, 2012), pp. 309-325 

5. Z. Brakerski, V. Vaikuntanathan, Fully homomorphic encryption from ring-LWE and security 
for key dependent messages, in Proceedings of CRYPTO 2011. LNCS, vol. 6841 (Springer, 
Berlin, Heidelberg, 2011), pp. 505-524 

6. H. Chen, K. Lauter, K.E. Stange, Security considerations for Galois non-dual RLWE families, 
in Proceedings of SAC 2016. LNCS, vol. 10532 (Springer, Cham, 2016), pp. 443-462 


32 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


23; 


24. 


25. 


26. 


27. 


28. 


29. 


C.-M. Cheng et al. 


. Y. Chen, P.Q. Nguyen, BKZ 2.0: better lattice security estimates, in Proceedings of ASIACRYPT 


2011. LNCS, vol. 7073 (Springer, Berlin, Heidelberg, 2011), pp. 1-20 


. D. Coppersmith, Small solutions to polynomial equations, and low exponent RSA vulnerabil- 


ities. J. Cryptol. 10(4), 233-260 (1997). Springer 


. J. Fan, F. Vercauteren, Somewhat practical fully homomorphic encryption. Cryptology ePrint 


Archive, Report 2012/144 (2012) 


. N. Gama, P.Q. Nguyen, Predicting lattice reduction, in Proceedings of EUROCRYPT 2008. 


LNCS, vol. 4965. Springer, Berlin, Heidelberg, 2008), pp. 31-51 


. S. Halevi, V. Shoup, Algorithms in HElib, in Proceedings of CRYPTO 2014. LNCS, vol. 8616. 


(Springer, Berlin, Heidelberg, 2014), pp. 554-571 


. R. Kannan, Minkowski’s Convex body theorem and integer programming, Mathematics of 


Operations Research, vol. 12 (3), pp. 415-440, INFORMS, Linthicum, Maryland, USA, (1987) 


. A.K. Lenstra, H.W. Lenstra Jr., L. Lovasz, Factoring polynomials with rational coefficients, 


Math. Ann. 261(4), 515-534 (1982). Springer 


. T. Lepoint, M. Naehrig, A comparison of the homomorphic encryption schemes FV and 


YASHE, in Proceedings of AFRICACRYPT 2014. LNCS, vol 8469. (Springer, Cham, 2014), 
pp. 318-335 

V. Lyubashevsky, C. Peikert, O. Regev, On ideal lattices and learning with errors over rings, 
in Proceedings of EUROCRYPT 2010. LNCS, vol. 6110 (Springer, Berlin, Heidelberg, 2010), 
pp. 1-23 

V. Lyubashevsky, C. Peikert, O. Regev, On ideal lattices and learning with errors over rings. J. 
ACM (JACM) 60(6), 43:1-43:35 (2013), ACM New York, NY, USA 

V. Lyubashevsky, C. Peikert, O. Regev, A toolkit for ring-LWE cryptography, in Proceedings 
of EUROCRYPT 2013. LNCS, vol. 7881 (Springer, Berlin, Heidelberg, 2013), pp. 35-54 

P. Morandi, Field and galois theory, Graduate Texts in Mathematics, vol. 167 (Springer-Verlag, 
New York, 1996) 

C.P. Schnorr, M. Euchner, Lattice basis reduction: improved practical algorithms and solving 
subset sum problems. Math. Progr. 66(1-3), 181-199 (1994). Springer 

Y. Wang, Y. Aono, T. Takagi, An experimental study of Kannan’s embedding technique for 
the search LWE problem, in: Proceedings of ICICS 2017. LNCS, vol. 10631 (Springer, Cham, 
2018), pp. 541-553 

D.V. Bailey, C. Paar, Optimal extension fields for fast arithmetic in public-key algorithms, in 
Advances in Cryptology - CRYPTO ’98, 18th Annual International Cryptology Conference, 
Santa Barbara, California, USA, August 23-27, 1998, Proceedings (Springer, 1998), pp. 472- 
485 

D.J. Bernstein, Curve25519: new diffie-hellman speed records. in Public Key Cryptography - 
PKC 2006, 9th International Conference on Theory and Practice of Public-Key Cryptography, 
New York, NY, USA, April 24-26, 2006, Proceedings (Springer, 2006) pp. 207-228 

D.J. Bernstein, P. Birkner, M. Joye, T. Lange, C. Peters, Twisted Edwards curves. [ACR Cryp- 
tology ePrint Archive 2008, 13 (2008) 

D.J. Bernstein, T. Lange, Faster addition and doubling on elliptic curves. IACR Cryptology 
ePrint Archive 2007, 286 (2007) 

C. Diem, On the discrete logarithm problem in class groups of curves. Math. Comput. 80(273), 
443—475 (2011) 

J. Faugére, P. Gaudry, L. Huot, G. Renault, Using symmetries in the index calculus for elliptic 
curves discrete logarithm. J. Cryptol. 27(4), 595-635 (2014) 

J. Faugére, L. Perret, C. Petit, G. Renault, Improving the complexity of index calculus algo- 
rithms in elliptic curves over binary fields. in Advances in Cryptology - EUROCRYPT 2012 
- 31st Annual International Conference on the Theory and Applications of Cryptographic 
Techniques, Cambridge, UK, April 15-19, 2012. Proceedings (Springer, 2012) pp. 27-44 
S.D. Galbraith, P. Gaudry, Recent progress on the elliptic curve discrete logarithm problem. 
Des. Codes Cryptogr. 78(1), 51-72 (2016) 

S.D. Galbraith, S.W. Gebregiyorgis, Summation polynomial algorithms for elliptic curves 
in characteristic two. in Progress in Cryptology - INDOCRYPT 2014 - 15th International 


2 Cryptography Core Technology 33 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


Conference on Cryptology in India, New Delhi, India, December 14-17, 2014, Proceedings 
(Springer, 2014), pp. 409—427 

P. Gaudry, Index calculus for abelian varieties of small dimension and the elliptic curve discrete 
logarithm problem. J. Symb. Comput. 44(12), 1690-1702 (2009) 

Y. Huang, C. Petit, N. Shinohara, T. Takagi, Improvement of Faugère et al.’s Method to Solve 
ECDLP, in Advances in Information and Computer Security - 8th International Workshop on 
Security, IWSEC 2013, Okinawa, Japan, November 18-20, 2013, Proceedings (Springer, 2013), 
pp. 115-132 

P.L. Montgomery, Speeding the Pollard and elliptic curve methods of factorization. Math. Com- 
put. 48, 243-264 (1987). URLhttp://links.jstor.org/sici?sici=0025-5718(198701)48:177<243: 
STPAEC>2.0.CO;2-3 

C. Petit, J. Quisquater, On polynomial systems arising from a weil descent. [ACR Cryptology 
ePrint Archive 2012, 146 (2012) 

J.M. Pollard, Monte Carlo methods for index computation mod p. Math. Comput. 32, 918-924 
(1978) 

I.A. Semaev, Summation polynomials and the discrete logarithm problem on elliptic curves. 
IACR Cryptology ePrint Archive 2004, 31 (2004) 

N.P. Smart, The hessian form of an elliptic curve, in Cryptographic Hardware and Embed- 
ded Systems - CHES 2001, Third International Workshop, Paris, France, May 14-16, 2001, 
Proceedings, number Generators. (Springer, 2001), pp. 118-125 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, 
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate 
credit to the original author(s) and the source, provide a link to the Creative Commons license and 
indicate if changes were made. 


The images or other third party material in this chapter are included in the chapter’s Creative 


Commons license, unless indicated otherwise in a credit line to the material. If material is not 
included in the chapter’s Creative Commons license and your intended use is not permitted by 
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from 
the copyright holder. 


Chapter 3 A) 
Secure Primitive for Big Data Utilization | ss 


Akinori Kawachi, Atsuko Miyaji, Kazuhisa Nakasho, Yiying Qi, 
and Yuuki Takano 


Abstract In this chapter, we describe two security primitives for big data utilization. 
One is a privacy-preserving data integration among databases distributed in differ- 
ent organizations. This primitive integrates the same data among databases kept in 
different organizations while keeping any different data in an organization secret 
to other organizations. Another is a privacy-preserving classification. This primitive 
executes a procedure for server’s classification rule to client’s input database and 
outputs only the result to the client while keeping the client’s input database secret to 
the server and server’s classification rule to the client. These primitives can be exe- 
cuted not only independently but also jointly. That is, after we integrate databases 
from distributed organization by executing the privacy-preserving data integration, 
we can execute a privacy-preserving classification. 


3.1 Privacy-Preserving Data Integration 


3.1.1 Introduction 


Medical organizations often store the data accumulated through medical analyses. 
However, detailed data analysis sometimes requires separate datasets to be integrated 
without violating patient or commercial privacy. Consider the scenario in which the 


A. Kawachi 
Mie University, 1577 Kurimamachiya-cho, Tsu City, Mie 514-8507, Japan 
e-mail: kawachi@cs.info.mie-u.ac.jp 


A. Miyaji (BX) - Y. Qi - Y. Takano 
Osaka University, 1-1 Yamadaoka, Suita, Osaka 565-0871, Japan 
e-mail: miyaji@comm.eng.osaka-u.ac.jp 


Y. Takano 
e-mail: ytakano@cy2sec.comm.eng.osaka-u.ac.jp 


K. Nakasho 
Yamaguchi University, 1677-1 Yoshida, Yamaguchi City, Yamaguchi 753-8511, Japan 
e-mail: nakasho @ yamaguchi-u.ac.jp 


© The Author(s) 2020 35 
A. Miyaji and T. Mimoto (eds.), Security Infrastructure Technology 

for Integrated Utilization of Big Data, 

https://doi.org/10.1007/978-98 1- 15-3654-0_3 


36 A. Kawachi et al. 


occurrence of similar accidents can be attributed to a particular defective product. 
Such defective products should be identified as quickly as possible. However, the 
databases related to accidents are maintained separately by different organizations. 
Thus, investigating the causes of accidents is often time-consuming. For example, 
assume child A has broken her/his leg at school, but it is not clear whether the 
accident was caused by defective equipment. In this case, information relating to A’s 
injury, such as the patient’s name and type of injury, is stored in hospital database 
Sı. Information pertaining to A’s accident, such as their name and the location of the 
swing at the school, is stored in database S2, which is held by the fire department. 
Finally, information relating to the insurance claim following A’s accident, such as 
the name and medical costs, is maintained in the insurance company’s database, $3. 
Computing the intersection of these databases, Sı N S2 N S3, without compromising 
privacy would enable us to combine the separate sets of information, which may 
allow the cause of the accident to be identified. Let us consider another situation. 
Several clinics, denoted as P;, maintain separate databases, represented as S;. The 
clinics wish to know the patients they have in common to enable them to share 
treatment details; however, P; should not be able to access any information about 
patients not stored in their own dataset. In this case, the intersection of the set must 
not reveal private information. 

These examples illustrate the need for the Multiparty Private Set Intersection 
(MPSI) protocol [1—4]. MPSI is executed by multiple parties who jointly compute 
the intersection of their private datasets. Ultimately, only designated parties can 
access this intersection. Previous protocols are impractical because the bulk of the 
computation depends on the number of players. One previous study required the 
size of the datasets maintained by the different players to be equal [1, 2]. Another 
study [3] computed only the approximate number of intersections, whereas other 
researchers [4] required more than two trusted third-parties. 

In this section, we propose a practical MPSI with the following features: 

1. The size of the datasets maintained by each party is independent of those main- 
tained by the other parties. 

2. The computational complexity for each party is independent of the number of 
parties. This is accomplished by introducing an outsourcing provider, O. In fact, all 
computations related to the number of parties are carried out by O. Thus, the number 
of parties is irrelevant. 


3.1.2 Preliminaries 


In this section, we summarize the DDH assumption, Bloom filter, and ElGamal 
encryption. We consider security according to the honest-but-curious model [5]: all 
players act according to their prescribed actions in the protocol. A protocol that is 
secure in an honest-but-curious model does not allow any player to gain information 
about other players’ private input sets, besides that that can be deduced from the result 
of the protocol. Note that the term adversary here refers to insiders, i.e., protocol 
participants. Outsider adversaries are not considered. In fact, behavior by outsider 
adversaries can be mitigated via standard network security techniques. 
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Our protocol is based on the following security assumption. 


Definition 3.1 (DDH Assumption) Lett be a security parameter. A decisional Diffie— 
Hellman (DDH) parameter generator IG is a probabilistic polynomial time (PPT) 
algorithm, a finite field F,,, and a basepoint g € F, with prime order q. We say 
that IG satisfies the DDH assumption if |p; — p2| is negligible (in «) for all 
PPT algorithms A, where pı = Pr[(F,, 8) — JG"); yi = g"', y2 = 8° <—F,: 
A(Fp, 8, y1, 2,8") = 0] and pz = Pr[(Fp, 8) — LG"); yi = 8", y2 = 8”, 
a= F, : A(F,, &» Yi, 2, z)=0]. 


A Bloom filter [6], denoted by BF, consists of m arrays and has a space-efficient 
probabilistic data structure. The BF can check whether an element x is included in 
a set S by encoding S with at most w elements. The encoded Bloom filter of S$ is 
denoted by BF(S). 

The BF uses a set of k independent uniform hash functions H = {Ho,..., Hx}, 
where H; : {0, 1}* —> {0,1,...,m— 1} for 0 < Vi < k — 1. The BF consists of 
two functions: Const embeds a given set S into BF(.S) and ElementCheck checks 
whether an element x is included in S. SetCheck, an extension of ElementCheck, 
checks whether an element x in S’ is in S'N S (see Algorithm 3.3). In Const (see 
Algorithm 3.1), BF(S) is constructed for a given set S by first setting all bits in the 
array to 0. To embed an element x € S into the filter, the element is hashed using k 
hash functions to obtain k index numbers, and the bits at these indexes are set to 1, 
i.e., set BFLH;(x)] = 1 for 0 < i < k — 1. In ElementCheck (see Algorithm 3.2), 
we check all locations where x is hashed; x is considered to be not in S if any bit at 
these locations is 0; otherwise, x is probably in S. 

Some false positive matches may occur, i.e., it is possible that all BF[H;(y)] 
are set to 1, but y is not in S. The false positive rate FPR is given by FPR= 


{1 -(1- 1] ~ f1 — e/m i [7]. However, false negatives are not possible, 
and so Bloom filters have a 100% recall rate. 


Input: A set S 

Output: A Bloom filter 
BF(S) 

: fori = 0 to m — 1 do 

BF(S)[i] < 0 

: end for 

: for all x € S do 

for i = 0 to k — 1 do 
j= Aj(x) 
if BF(S)[j] = 0 
then 

8: BF(S)[j] < 1 

9: end if 

10: end for 

11: end for 

12: output BF(S). stop. 


SOW Oy Goo ke 


Algorithm 3.2 
ElementCheck(BF, x) 


Input: A Bloom filter BF(S), 
an element x 

Output: | if x € S and 0 if 
x€éS8 

1: fori = 0 to k — 1 do 

2: j = Hix) 

3: if BF(S)[j] = 0 then 

4: output 0. stop. 

5 

6: 

Ti 


end if 
end for 
output 1. stop. 


SetCheck(BF, $^) 


Input: A Bloom filter BF(S), 
a set S’ 

Output: A set Sa(= SN S’) 

1: Sn <_ {} 

2: for all x € S’ do 

3: fori =Otok—1do 


4. j= HG) 

5: if BF[j] = 0 then 
6: go to next x. 

T: end if 

8: endfor 


9: add x to the set Sn 
10: end for 
11: output Sn. stop. 
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Homomorphic encryption under addition is useful for processing encrypted data. 
A typical homomorphic encryption under addition was proposed by Paillier [8]. 
However, because Paillier encryption cannot reduce the order of a composite group, 
it is computationally expensive compared with the following ElGamal encryption. 
Our protocol requires matching without revealing the original messages, for which 
exponential ElGamal encryption (exElGamal) is sufficient [9]. In fact, the decrypted 
results of exElGamal encryption can distinguish whether two messages mı and m2 
are equal, although the exElGamal scheme cannot decrypt messages itself. Further- 
more, exE]Gamal can be used in (n, n)-threshold distributed decryption [10], where 
the decryption must be performed by all players acting together. An exElGamal 
encryption with (n, n)-threshold distributed decryption consists of three functions: 
Key generation: 
Let F, be a finite field, g € F,, with prime order q. Each player P; chooses x; € Z4 
at random and computes y; = g™ (mod p). Then, y = []j_,; y; (mod p) isa public 
key and each x; is a share for each player to decrypt a ciphertext. 
Encryption: thrEnc[m] — (u, v) 
Let m € Z} be a message. Choose r € Z4 at random, and compute both u = g” 
(mod p) and v = g™ y” (mod p) for the input message m € Z4 and a public key y. 
Output (u, v) as a ciphertext of m. 
Decryption: thrDec[(u, v)] > g” 
Each player P; computes z; = u“ (mod p). All players then compute z = [[}_, zi 
(mod p) jointly.! Finally, each player can decrypt the ciphertext as g” = v/z 
(mod p). 
ExElGamal encryption with (n, n)-threshold decryption has the following features: 
(1) homomorphic under addition: Enc(m,)Enc(m 2) = Enc(m, + m2) for messages 
mı, m E€ Zp. 
(2) homomorphic under scalar operations: Enc(m)* = Enc(km) for a message m 
and k € Z4. 


3.1.3 Previous Work 


This section summarizes prior works on PSI between a server and a client and MPSI 
among n players. In PSI, let S = {s1, ..., Sp} and C = {c1, .. . , Cy} be server and 
client datasets, respectively, where |S| = v and |C| = w. In MPSI [1], we assume 
that each player holds the same number of datasets. 

PSI protocol based on polynomial representation: The main idea is to represent 
the elements in C as the roots of a polynomial. The encrypted polynomial is sent 
to the server, where it is evaluated on the elements in S, as originally proposed by 


'The computational complexity of z for each player can be made independent of the number of 
players in various ways. For example, set z = 1. P4 computes z = z- zı and sends z to P2, P2 
computes z = z - z2 and sends z to P3, and, finally, P,, computes z = z - zn and shares z among all 
players. If we place all players in a binary tree, the communication complexity can be reduced, but 
each player’s computational complexity is still independent of the number of players. 
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Freedman [11]. This is secure against honest-but-curious adversaries under secure 
public key encryption. The computational complexity is O(vw) exponentiations, 
and the communication overhead is O(v + w). The computational complexity can 
be reduced to O(v log log w) exponentiations using the balanced allocation technique 
[12]. Kissner and Song extended this protocol to MPSI [1], which requires O (nw?) 
exponentiations and O (nw) communication overhead. The MPSI version is secure 
against honest-but-curious and malicious adversaries (in the random oracle model) 
using generic zero-knowledge proofs. 

PSI protocol based on DH-key agreement: The main objective here is to apply the 
DH-key agreement protocol [13]: after representing the server and client datasets 
as hash values {h(s;)} and {h(c;)}, respectively, the client encrypts the dataset as 
{h(c;)"} using a random number r; and sends the encrypted set to the server. The 
server encrypts the client set {h(c;)"'} and the server set {h(s;)} using a random 
number r, which gives {h(c;)""'} and {h(s;)’}, respectively, and returns these sets 
to the client. Finally, the client evaluates $C by decrypting to {h(c;)"}. This is 
secure against honest-but-curious adversaries under the DDH assumption. The total 
computational complexity is O(v + w) exponentiations, and the total communica- 
tion overhead is O(v + w). The security of this approach can be enhanced against 
malicious adversaries in the random oracle model [14] by using a blind signature. 
However, no extensions to MPSI based on the DH-key agreement protocol have been 
proposed. 

PSI protocol based on BF: This protocol was originally proposed in [4]. As the 
Bloom filter itself reveals information about the other player’s dataset, the set of 
players is separated into two groups: input players who have datasets and privacy 
players who perform private computations under shared secret information. In [15], 
the privacy of each player’s dataset is protected by encrypting each array of the Bloom 
filter using Goldwasser—Micali encryption [16]. In an honest-but-curious version, the 
computational complexity is O (kw) hash operations and O (m) public key operations, 
and the communication overhead is O(m), where m and k are the number of arrays 
and hash functions, respectively, used in the Bloom filter. The Bloom filter is used in 
the Oblivious transfer extension [17, 18] and the newly constructed garbled Bloom 
filter [19]. The main novelty in the garbled Bloom filter is that each array requires 
À bits rather than the single bit needed for the conventional Bloom filter. To embed 
an element x € S to a garbled Bloom filter, x is split into k shares with À bits using 
XOR-based secret sharing (x = x; @® - - - @ xz). The x; are then mapped to an index 
of H;(x). An element y is queried by subjecting all bit strings at H;(y) to an XOR 
operation. If the result is y, then y is in S; otherwise, y is not in S. The client uses 
a Bloom filter BF(C), and the server uses a garbled Bloom filter GBF(S). If x is 
in C N S, then for every position i it hashes to, BF(C)[i] must be 1 and GBF(S)[i] 
must be x;. Thus, the client can compute C N S. The computational complexity of 
this method is O(kw) hash operations and O(m) public key operations, and the 
communication overhead is O(m). The number of public key operations can be 
changed to O (A) using the Oblivious transfer extension. This is secure against honest- 
but-curious adversaries if the Oblivious transfer protocol is secure. Finally, some 
researchers have computed the approximate number of multiparty set unions [3]. 


40 A. Kawachi et al. 


3.1.4 Practical MPSI 


This section presents a practical MPSI that is secure under the honest-but-curious 
model. 


3.1.4.1 Notation and Privacy Definition 


In the remainder of this paper, the following notations are used. 


e P;: ith player, i =1,...,n 

e O: outsourcing provider with no knowledge of the inputs or outputs 

e Si = {5),1, 51,2, ---+5 Siw, }: dataset held by P;, where |S;| = wi 

e Sj: intersection of all n players 

e thrEnc and thrDec: (n, n)-threshold exElGamal encryption and decryption, 
respectively 

m and k: number of arrays and hashes used in BF 

e L= [£,..., £] (1 < £ < n): an n-dimensional array, where all strings in the array 
are set to £ 

BF(S;) = [BF;[0], ..., BF;[m — 1]]: Bloom filter applied to a set S; 

IBF(S;) = [D BF;[0], ..., $; BF;[m — 1]]: integrated Bloom filter of n 
sets {S;}, where };_; BF;[j] is the sum of all players’ arrays 


We introduce an outsourcing provider O to reduce the computational burden on 
all players. The dealer has no information regarding the elements of any player’s set. 
The privacy issues faced by MPSI with an outsourcing provider can be informally 
written as follows. 


Definition 3.2 (MPSI privacy) An MPSI scheme with an outsourcing provider O 
is player-private if the following two conditions hold: 


e P; does not learn anything about the elements of other players’ datasets except for 
the elements that P; originally possesses. 

e the outsourcing provider O does not learn anything about the elements of any 
player’s set. 


3.1.4.2 Proposed MPSI 


Our MPSI comprises four phases: (i) initialization, (ii) Bloom filter construction and 
the encryption of P; data, (iii) the O’s randomization of thrEnc(IBF(US;) — n), and 
(iv) the computation of NP;. The computation of NP; consists of three steps: (a) 
joint decryption of an (n, n)-threshold exElGamal among n players, (b) Bloom filter 
check, and (c) output intersection. 

Figure 3.1 shows an overview of our protocol after the initialization phase. The 
system parameters of a finite field F, and a basepoint g € F, with order q for an 
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Pil 


Construction and encryption of BF(S;)-1 


o 


Randomization of Enc(IBF(U S;) — n) 


Enc 


P,|BF,[0] | + |BF,[m—1] | ===> | Enc(BF; [0] — 1) | ++ | Enc(BF,[m— 1] — 1) 
Enc(IBF[0] — n) “ Enc(IBF[m — 1] - n) 
Enc 
P,, |[BF,[0] | + |BF,[m—1] | ===» | Enc(BF,[0] - 1) | +> |Enc(BFa[m — 1] — 1) x 
Enc(ro) id Enc(rm-1) 
Computation of N S; : Joint decryption = 
P4 | ro(BF[O]—n) | ++ | rm-1(BF[m — 1] — n) | Ene(rolIBF[0] - n)) | ++ _|Enc(rm-10BF[m — 1] - n)) 
Pn | ro(IBF[0] — n) * | rm-1(IBF[m — 1] — n) 


Computation of N S; : Check BF and output 


SetCheck 


P4 | ro(IBF[0]—n) | ++ | rm-1(IBF[m-— 1] - n) ——————* ns 
Pa | oBF[0]-n) | > |rm-1(IBF[m- 1] - n) |__SetCheck 


Fig. 3.1 Overview of our MPSI 


(n, n)-threshold exElGamal encryption (thrEnc, thrDec) are provided to both P; 
and O. For the Bloom filter, Const(S) and SetCheck(BF, $’) are only provided to 
P;, where the array size is m and k independent hash functions are used. 

To encrypt, randomize, or subtract a vector such as a Bloom filter BF = [aọ, . . 
Am—1], each location is encrypted, randomized, or subtracted independently: 


-3 


thrEnc(BF) = [thrEnC(ao), ... , thrEnc(a,_1)], 
rBF = [rodo, .. . , m-14m-1], OF 
BF -r= [ao — F0, +++, 4m-1 — Fm-1] 
for r = [ro,...,Tm—1] € Zg- 
Our protocol proceeds as follows. 


Initialization: 


1. P; generates x; € Z4, computes y; = g" € Z,, and publishes y; to the other play- 
ers as a public key, where the corresponding secret key is x;. 

2. P; computes y = [[; y;, where y is the n-player public key. Note that no 
player knows the corresponding secret key x = }_ x; before executing the joint 
decryption. 


Construction and encryption of BF(S;) — 1: 


1. P; executes Const(S;) —> BF(S;) = [BF;[0],..., BF;[m — 1]] (Algorithm 3.1). 
2. P; encrypts BF(S;) — 1 using thrEnc,: 


thrEnc, (BF(S;) — 1) = [thrEnc, (BF; [0] — 1), ..., thrEnc, (BFj[m — 1] — 1], 
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where y is an n-player public key. 

3. P; sends thrEnc,(BF(S;) — 1) to O. 

Randomization of thrEnc(IBF(S;) — n): 


1. O encrypts IBF(NS;) — n without knowing IBF(NS;) using an additive homo- 
morphic feature and multiplying by thrEnc, (BF(S;) — 1) as follows: 


thrEnc, (IBF(NS;) — n) = | [ thrEnc, (BF(S;) — 1). 


i=l 
2. O randomizes thrEnc, (IBF(1S;) — n) by r = [7o,..-, fm—1] € Ly 


thrEnc, (r(IBF(NS;) — n)) = (thrEnc, (IBF(US;) — n))*. 
3. O broadcasts thrEnc, (r(IBF(1S;) — n)) to P;. 
Computation of 1S;: 


1. All players decrypt thrEnc, (r(IBF(S;) — n)) jointly. 
2. P; computes SetCheck(r(IBF(MS;) — n), S;) and obtains NS;. 


The above protocol satisfies the correctness requirement. This is because each array 
position of thrEnc, (r(IBF(S;) — n)) is decrypted to 1, where x € MS; is embedded 
by each hash function; however, each array position for which x ¢ NS; is embedded 
by each hash function is decrypted to a random value. 


3.1.4.3 Security Proof 


The security of our MPSI protocol is as follows. 


Theorem 3.1 For any coalition of fewer than n players, the MPSI is player-private 
against an honest-but-curious adversary under the DDH assumption. 


Proof The views of P; and O, that is, 
thrEnc, (BF »,«(S;)) = [thrEnc, (BF; [0]), ..., thrEnc,(BF;[m — 1])], 


are shown to be indistinguishable from a random vector r = [7o,...,%m—1] € ZG. 
Assume that a polynomial-time distinguisher D outputs 0 when the views are pre- 
sented as a random vector and outputs 1 when they are constructed in MPSI, 
thrEnc(BF;,[0]), ..., thrEnc(BF,[m — 1]). We show that a simulator SIM that 
solves the DDH assumption can be constructed as follows. 

Upon receiving a DDH challenge (g, 8°, zê, g), SIM executes the following: 


1. Set n-player public key y = g’ and choose random numbers do, ... , dm-1 and 
Iy-++5Tm—1 from Zg. 
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2. Send [8% 8" 2) (GUS Eien EI ey as 
thrEnc, (BF, (S;)) to D. 


If (@,g°, 2°, @) is a DH-key-agreement-protocol element, i.e., y = a3, then 
thrEnc, (BF,,,.(5S;)) is distributed in the same way as when constructed by the 
MPSI scheme. Thus, D must output 1. If (g, 8%, 8°, g’) is not a DH tuple, then 
thrEnc, (BF,,,,.(S;)) is randomly distributed, and D has to output 0. Therefore, SIM 
can use the output of D to respond to the DDH challenge correctly. Therefore, D 
can answer correctly with negligible advantage over random guessing. Furthermore, 
as all inputs of each player are encrypted until the decryption is performed, and 
decryption cannot be performed by fewer than n players, nothing can be learned by 
any player prior to decryption. 

As for the views of thrEnc, (r(IBF,,,.(95;) — m)), the same argument holds. 
Therefore, for any coalition of fewer than n players, MPSI is player-private under 
the honest-but-curious model. 


Next, we present d-and-over MPSI. The procedures of d-and-over MPSI are the 
same as those of MPSI until O computes thrEnc, (IBF(S;)). Thus, we describe the 
procedure after O computes thrEnc, (IBF (AS;)). 


Encryption of -subtraction of IBF(1S;): O executes the following: 


1. Encrypt IBF(NS;) — £ randomized by r = [70, .--, Fm-1] € Zi (d<£<n): 
thrEnc, (r(IBF(S;) — £)) = (thrEnc, (IBF (N S;)) - thrEnc,(—£))". 
2. Broadcast {thrEnc, (r(IBF(1S;) — £))}e (d < £ < n) to Pj. 


d-and-over MPSI computation: P; executes the following: 


1. All P; jointly decrypt {thrEnc, (r(IBF(S;) — £))}e. 

2. Let CBF, be an m-array for d < £ < n, where an array is set to 1 if and only if 
the corresponding array of rlIBF(NS;) — £ is 1, and others are set to 0. 

3. Set CBF = CBF; v --- v CBF,. 

4. Execute SetCheck,,, (CBF, S;) —> N=“S[i] and output N=“ S[i]. 


The correctness of d-and-over MPSI follows from the fact that if an element x € NÉS 
for d < I£ < n, the corresponding array locations in IBF(NS;) — j for £ < 3j <n, 
where x is mapped by k hashes, are an encryption of 0, which are decrypted to 1; 
otherwise, it is an encryption of randomized value. 


3.1.5 Efficiency 


Although many PSI protocols have been proposed, to the best of our knowledge, 
relatively few consider the multiparty scenario [1—4]. Our target is multiparty private 
set intersection, and the final result must be obtained by all players acting together, 
without a trusted third-party (TTP). Among previous MPSI protocols, the approach 
in [3] computes only the approximate number of intersections, and that in [4] requires 
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Table 3.1 Efficiency of [1] and the proposed protocol 


[1] Ours 
Computational complexity O (nw?) P; : O(wi), O : O (nw) 
Communication overhead O (nw) Pi: O(w +n), O : O(nw) 
Restriction on set size [Sil =--- = [Sa] None 
Protected values Si (vi € [1,n]) Si, |Sil(Vi € [1,n]) 


more than two TTPs. In contrast, [2] follows almost the same method as [1] and thus 
has a similar complexity. The only difference exists in the security model. Hence, 
we only compare our scheme with that of [1]. 

The computational and communication efficiency of the proposed protocol and [1] 
are compared in Table3.1. These approaches are secure against honest-but-curious 
adversaries without a TTP under exElGamal encryption (DDH security) and Paillier 
encryption (Decisional Composite Residue (DCR) security), respectively. The Bloom 
filter parameters (m, k) used in our protocol are set as follows: k = 80 and m = 
80w/ 1n 2, where w is the maximum |S; | = w;. Then, the probability of false positives 
is given by p = 27*°. 

Our MPSI uses the Bloom filter for the computations performed by P; and the 
integrations performed by the O. The use of a Bloom filter eliminates the restriction 
on set size. Thus, in our MPSI, the set size of each player is flexible. However, P;’s 
computations consist of Bloom filter construction, joint decryption, and Bloom filter 
check. Neither the computations related to the Bloom filter nor the joint decryption 
depends on the number of players, as shown in Sect. 3.1.2. In summary, the computa- 
tional complexity of operations performed by P; is O(w;). All player-dependent data 
are sent to O, who integrates []}_, thrEnc,(IBF(NS;)) without decryption. There- 
fore, the computational complexity of operations performed by O is O(nw). 


3.1.6 System and Performance 


PSI or MPSI implicitly assumes that every attendee can provide data, any attendee 
can retrieve data from the shared data, and all attendees can communicate with 
each other. If PSI or MPSI is implemented straightforwardly, such implementation 
should become a system like a peer-to-peer (P2P) network system. Although a fully 
distributed system like P2P network has attractive features, such as high availability 
and scalability, it incurs some unfavorable features. 

The network address and port translation (NAPT) is a major obstacle for P2P 
network systems. Modern P2P network systems take advantage of NAPT traversal 
technologies to overcome NAPT, but it should be costly to make the architecture 
complex. The absence of trusted node is also an obstacle for attendee or group 
management. Making consensus on a P2P network system is difficult or highly 
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Provide Data 
Retrieve Data 


Dealer 


P2P Model Our Client Server Model 


Fig. 3.2 P2P and client server model 


costly. Additionally, unpredictable node joining and leaving are reasons that make 
the P2P network systems complex. To avoid the complexities of P2P networks, we 
designed a system based on the client server model. 

Then, we discuss the design of PSI or MPSI’s client server model. There are 2 
main functionalities of PSI or MPSI: (1) First, the data sharing is a functionality for 
sharing data among attendees. (2) Next, the data retrieving from the shared data is 
a functionality. Any attendee can retrieve data from the shared data, but the retriev- 
ing avoids correcting privacy sensitive data by using privacy preserving techniques 
described above. 

However, we do not assume that every attendee provides and retrieves data. Imag- 
ine that an incident analysis situation in which data are provided by several orga- 
nizations which employ labor and operate some machines, and a research institute 
collects data from the organizations and analyzes it. In such a situation, data providers 
do not need the data retrieving functionality, and data analysts do not need the data 
sharing functionality. 

Therefore, we define 3 roles for our MPSI application design as follows. 


e Parties: entities for data providing 
e Clients: entities for data retrieving 
e Dealer: an entity for forwarding requests between parties and clients 


From the perspective of privilege separation, defining and separating roles are signif- 
icant. Figure 3.2 shows a P2P network model and our client server model. As show in 
this figure, every P2P network node is connected to each other and can provide and 
retrieve data, but parties only provide data and clients only retrieve data in the client 
server model. The dealer forwards requests from parties and clients and provides 
other functionalities that are not specified by PSI or MPSI. For example, attendee or 
group management, user authentication, and data logging should be performed by 
the dealer. 

Figure 3.3 shows an example sequence diagram of our MPSI application. In this 
figure, there are 2 parties, | client, and 1 dealer. First of all, parties 1 and 2 join 
the dealer (join p1 and p2). A party must join before providing data, and it must be 
performed only once at initialization. After that, the client sends a request of data 
retrieval to the dealer (cl req), and parties send a request to confirm whether the dealer 
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Fig. 3.3 Sequence diagram of MSPI application 
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Fig. 3.4 Performance 


received data retrieval requests by clients (new-req p1 and p2). Then, the parties and 
the dealer generate keys, share the keys, encrypt data, and decrypt data (gpk p1 and 
p2, enc pl and p2, and dec p1 and p2). Finally, the client gets the result from the 
dealer. 

We measured performance of our MPSI application written in Python language 
on an Amazon’s EC2 server (2.4GHz CPU, 1 GB Memory). Figure 3.4 shows the 
results when there are from 2 to 4 parties which provide data including 10,000 entries. 
The results show that it takes approximately 280s to accomplish data retrieval and 
that the computational amount does not depend on the number of parties. 
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3.2 Classification 


In this section, we present a secure classification protocol, a type of secure computa- 
tion protocols. We assume two participants Alice and Bob of the protocol. Alice has 
private data x, and Bob has a classification model C. The task is that Alice learns 
C(x) at the end of the protocol while preserving the privacy of x and C. That is, 
Alice can learn only C(x) and Bob can learn nothing. Our construction is based on 
a code-based public-key encryption scheme called HQC [20], which is a candidate 
of NIST’s Post-Quantum Cryptography standardization [21]. 


3.2.1 Error-Correcting Code 


We start with several fundamental notions for error-correcting codes. 


Definition 3.3 (Linear code) A code C such that cı + c2 € C always holds for any 
codeword c1, c2 € C is called a linear code. The code C of code length n and infor- 
mation bit number k is described as “a” code. 


Definition 3.4 (Generation matrix) For matrices G € F**",G that satisfy 
C= {m-Gim € F*} (3.1) 


is called a generator matrix. The generator matrix is the basis of linear codes and 
generates all codewords. 


Definition 3.5 (Parity check matrix) For a matrix H € F”—*", H that satisfies 
C= {x €F"|H-x' = 0} (3.2) 


is called a parity check matrix. 


Definition 3.6 (Cyclic matrix) When x = (x1, ..., Xn) € F”, the circulant matrix 
for x is defined as 
Xi Xn +++ X2 
X2 Xı eee X3 
rotx)= |. . . . Jer” (3.3) 
Xn Xn-1 t X1 


In addition, the multiplication of two polynomials x,y has the following 
properties: B 
x. y=x xrot(y) 

= (rot(x) x y')' 
(rot(x) x y ) (3.4) 


= y x rot(x)! 
= y “xX. 
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Definition 3.7 (Cyclic shift) The operation of shifting (co, ..., Cn—1) to the right by 
one position with respect to n-dimensional vector c; (i = 0, ..., n — 2) and moving 
Cn— to the beginning of the vector is called cyclic shift. That is, for any n dimensional 
vector (Co, ..., Cn—1), itis a mapping o : (Co, C1, ---5 Cn—1) Œ> (Cn—1, CO, +++» Cn—2)- 


Definition 3.8 (Quasi-cyclic code) Let € = (co, ..., €s—1) € (F5)° be an arbitrary 
codeword of code C and let ø be a cyclic shift operation. If (o (co), ..., @(€s-1) € C, 
C is called the s-quasi-cyclic code. In particular, when s = 1, C is called a cyclic 
code. 


Definition 3.9 (Systematic quasi-cyclic code) An s-quasi-cyclic [sn,n] code is 
called a systematic quasi-cyclic code if it has a parity check matrix of the form. 


I, 0--- 0 Ay 
O I, A> 
H= : N (3.5) 
0 ssl Ari 
Here, A, ..., As_; is ann x n circulant matrix. 


3.2.2 Security Assumptions 


As mentioned above, the security of the public-key cryptosystem HQC is based on 
the computational difficulty of the quasi cyclic syndrome decoding problem. More 
specifically, its security is proved under the following quasi cyclic syndrome decoding 
decision assumptions. 


Definition 3.10 (quasi-cyclic syndrome decoding assumption) The quasi-cyclic 
syndrome decoding decision problem of a s-quasi-cyclic code in which n and w are 
integers and the number of blocks is s > 2 is (H, y') when the parity check matrix 


$ f A A ; 
H < Fs" and the matrix y gue F*”—” of random systematic quasi-cyclic 
code are given, every efficient algorithm distinguish only with negligible probability 
whether it is quasi-cyclic syndrome decoding distribution or the uniform distribution 
over Fern) xsn x For”) y 


As will be described later, since the security of the secure computation protocol 
proposed in this section is reduced to the security of HQC, the secure computation 
protocol of this section is proved to be secure under this assumption as well as under 
HQC. 
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3.2.3 Security Requirements for 2PC 


Secure two-party computation is a subproblem of multi-party secure computation. 
The studies have been conducted by many researchers since it is closely related to 
many cryptographic protocols. The purpose of 2PC is to construct a general-purpose 
protocol so that arbitrary functions can be jointly computed without sharing the input 
values of the two parties with the other. One of the best-known examples of 2PCs is 
the millionaire problem [22] in Yao, where Alice and Bob do not reveal their money 
and decide who is richer. Specifically, suppose that Alice has a yen, and Bob has b 
yen. The problem is to decide whether a > b or not while keeping each other secret. 
Generally speaking, the security requirement of 2PC is that the computation of any 
function is performed using a protocol without leaking the two inputs to the other, 
and only the computation result is known. 

A two-party linear function evaluation is a kind of 2PC that satisfies the 2PC 
security requirements. In other words, the participants perform the evaluation without 
notifying the other party of their input. In addition, the function of the protocol is 
the evaluation of linear functions. Specifically, linear function secure computation 
protocol computes f(m) = a-m + b. The participants in the protocol are called 
Alice and Bob. Alice’s input is m, and Bob’s input is linear function parameters a, b. 
Alice gets only the result of f(m) = a -m + b through the protocol, and Bob gets 
nothing. 

Below we define the security requirements for two-party linear function secure 
computation. 


Definition 3.11 (Security against semi-honest adversaries) Let f = (fa, fg) be 
the function that maps the input x of Alice(A) and the input y of Bob(B) to 
fa(x, y), fa(x, y). A aims to obtain f4(x, y) and B aims to obtain f(x, y). 

Let f = (fa, fB) be a function of probabilistic polynomial time, and 7 be a 
two-way protocol for computing function f. Let the view of A with (x, y) exe- 
cution (x, y) and the security parameter n be view4(x, y,m) and the view of 
B be views (x, y, n). The output of A is output? (x, y, n) and the output of B is 
output? (x, y, n). In addition, the joint output of the two is denoted as output” 
(x, y, n) = (output) (x, y, n), output} (x, y, n)). 

For semi-honest adversaries, we say that the protocol m(x, y) can securely com- 
pute the function f if there are probabilistic polynomial-time algorithms S, and Sg 
that satisfy the following equations. For any x, y that satisfy |x| = |y| =n, n € N, 
the following holds: 


Sad”, x, fax, y)), fŒ, Whyy,n 
={(view’ (x, y, n), output” (x, y, n))}x,y,ns 
SBA”, x, fax, y), fhe yn 


={(views, (x, y, n), output” (x, y, n))}x,y,n- 
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3.2.4 HQC Encryption Scheme 


The protocols proposed in this section are based on the Hamming Quasi-Cyclic 
cryptosystem of Gaborit et al. First, we introduce the cryptosystem proposed by 
Gaborit et al. [20], which is a public key cryptosystem based on the quasi-cyclic 
syndrome decoding problem. In this cryptosystem, two kinds of codes quasi-cyclic 
code and error-correcting code C are used. The error-correcting code C is an arbitrary 
linear code (such as a BCH code) used for message encoding and decoding and 
with sufficient error correction capability. A quasi-cyclic code is used for a security 
requirement of this public key cryptosystem to generate noise that an adversary 
cannot decrypt. 

The participants of the HQC cryptosystem are Alice (A) and Bob (B), and B 
aims to send the input message m securely to A. The cryptosystem is performed as 
follows: 


1. Global parameter settings: 
Parameters param = (n, k, ô, wx, Wr, We) and the sign C generation matrix G € 
F% xn 

2. Key generation: 


$ 
A generates random h <— R. 


Furthermore, (x, y) a R? is generated, and the Hamming weight of x, y is wy. 
Secret information sk = (x, y) Public information pk = (h,s=x+h-y).A 
sends public information pk to B. 

3. Encryption: 


B generates a random e pe R, (r1, r2) 2s R?. 
The Hamming weight of e is we, and the Hamming weight of rı and r2 is w,. 
Then, we compute u = rı + h - r2andv = m - G + s - r2 + e oninput m. B sends 
the ciphertext u, v back to A. 

4. Decryption: 
A uses the decoding function C.Decode(v — u » y) of the error-correcting code 
C to recover the message m of B. 


In the HQC cryptosystem, public information s is added to the message m encoded 
by the error-correcting code when it is encrypted. Since s is noise with a large 
Hamming weight generated by the quasi-cyclic code, security is guaranteed by the 
quasi-cyclic syndrome decoding decision assumption introduced above. In addition, 
A can use the secret key for the encrypted error-protected ciphertext in the decryp- 
tion stage, and can remove a large amount of noise from s. However, some noise of 
x + r2 — r1 » y + e remains. If the weight of this noise is smaller than the maximum 
number of correctable errors ô of the error-correcting code, correct decoding is pos- 
sible. Hamming weights w, w,, we = O(./n) are assumed and analyzed. Moreover, 
the conclusion that the probability of becoming w(x - r2 + e — y » r1) < ô increases 
as the code space n becomes larger is shown in the paper of Gaborit et al. In addi- 
tion, the HQC cryptosystem is IND-CPA secure under the quasi-cyclic syndrome 
decoding decision assumption. 
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3.2.5 Proposed Protocol 


3.2.5.1 Linear Function Evaluation 


We introduce the secure evaluation protocol of the linear functions between two 
patties. 

We use two codes, quasi-cyclic code and arbitrary error-correcting code C, based 
on Gaborit’s HQC cryptosystem. The participants in the protocol are Alice (A) and 
Bob (B). A’s input is m € F>, B’s input is a, b € F2, B’s output is nothing, and A’s 
output is a - m + b. The protocol is given in Protocol 3.2.5.1. 


Protocol Linear function evaluation protocol 


input A:m € F2 
B:a,b € F2 

output A:a-m+b 
B: L 


1. Global parameter param = (n, k, 6, wy, W,, We) and the sign C generation matrix 
G e F**” are chosen. 

2. A generates the random h $È R. Furthermore, (x, y = R?) is generated, and 
the Hamming weight of x and y is w. Secret information sk = (x, y), Public 
information pk = (h, s = x + h - y). 

3. By padding the input m with 0, A makes m = (m, 0, . .. , 0) of dimension k. A 
generates a random FA, Fus Fy rae R. Here, the Hamming weight of FA, Fu, Fy 
is w,. Then, we compute (u=h-rg+ry,,v=m-G+s-ratr,). A sends 
public information h, s and ciphertext pair u, v to B. 


4. Let B be b = (b,0,..., 0). Generate rg ae R and (eu, €) pem R2. Here, the 
Hamming weight of rg is w,, and the Hamming weight of e, and e, is we. B 
computes u' =a-u+h-rg+e,andv =a-v+b-G+5-rgt+e,.B sends 
u', v’ back to A. 

5. A uses C. Decode(v’ — u’ - y) to decode the error-correcting code C, and recovers 
a -m + b by taking the first bit of the result. 


First, we set global parameters. n is the code length of the code, k is the number 
of information bits, 6 is the maximum number of correctable errors in the error- 
correcting code, and w,, w,, we are Hamming weights set in advance. For example, 
it is half the weight of O(./n) assumed by Gaborit et al. The public parameter G is a 
generator matrix of error-correcting code C, which maps messages and codewords 
as FE > F}. 

A generates random h < R and (x, y) << R? and computes s =x +h- y. 
Here, 
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s=x+h-y 
=x-+y-rot(h)! (3.6) 
= (x y)(In rot(h))". 


It can be converted to and can be reduced to the quasi cyclic syndrome decoding 
problem. Then, A sets secret information sk as (x, y) and public information pk as 
(h, s). 

A pads the input m with 0, making m = (m, O, ..., 0) with dimension k. A gen- 


erates rA, Fus Fy g R, encodes the value of m with an error-correcting code, and 
re-randomizes it. A generates a ciphertext pair of (u = h -r4 +r,, v =m. G+ 
Ss. ra +r) and send it to B. As for B, v has a noise s that cannot be decoded, and 
has no secret information that can be removed, so B cannot learn m. 


$ 
B sets b = (b, 0, ..., 0) and generates rg 2 Rand (e,, ey) <— R?. B produces 
u =a- -u+h-rg+e,v =a-v+b-Gt+s-rpg +e, and re-randomize u and v 
after updating. Since the error-correcting code is a linear code, u’ and v’ after update 
are 


ee: h-rgptey (In the case of a = 0) (3.7) 

~ )uth-rg+e, (Inthe case of a=1). ` 
ie b-G+s-rpte, (In the case of a = 0) (3.8) 

~ |vy+b-G+s-rg +e, (Inthe case of a= 1). ' 

Finally, A uses his secret information to decrypt v’ — u’ - y. The result is 
v—u'-y 
=(am + b)G + x(ar4 +1rp) — yarn + eu) + (ary + ey) 

bG+xrg— ye, +e, (in the case of a = 0) (3.9) 


=4 (m+ b)G+x(r4 +rgB)— y(ru + eu) + (ty + ey) 
(in the case of a = 1). 


As shown by the Eq. (3.9), the result of v’ — u’- y is the result of removing h 
and s. Taking the first bit makes a - m + b available to A. 
3.2.5.2 Correctness and Security of the Proposed Protocol 
The correctness of the two-way linear function evaluation protocol proposed in this 
study obviously depends on the decoding ability of the code C. Specifically, assuming 


that C. Decode decodes v — u + y correctly, the following equation is satisfied: 


Decrypt(sk, Encrypt(pk,a-m+b))=a-m+b. (3.10) 
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Also, let € be the error of v — u + y. The error is 


xXrp— ye, +e (In the case of a = 0) 
e= x(ra +r) — yu + eu) + (r, + ey) (3.11) 
(In the case of a = 1) 


for the error correction capability of the code C. In the paper of Gaborit et al., 
C.Decode can work correctly when w(x - r2 +e — y-r,) < 6 is satisfied, and w, 
and we have the same value when actually evaluated. If the Hamming weight of 
Fos F1, Tus Fy, rg Of the protocol proposed in this section is set to 1/2 of w, of Gaborit 
et al., then, the Hamming weight of e,, e, is set to 1/2 of w, of Gaborit et al. The 
Hamming weight of the error Eq. (3.11) is less than or equal to the Hamming weight 
of errors in Gaborit et al.’s setting. Therefore, the conclusion of the paper of Gaborit et 
al. also holds for the proposed protocol. As the code length n increases, the decoding 
failure rate of the error-correcting code decreases. If the appropriate code space size 
n and noise Hamming weights w, and w, are set, the decoding failure rate approaches 
0. 

The security requirements of the proposed protocol are described above. In this 
section, we prove the security against semi-honest adversaries. 


Theorem 3.2 Under the quasi-cyclic syndrome decoding assumption, the 2PC pro- 
tocol securely computes linear functions for semi-honest adversaries. 


Proof First, consider the semi-honest adversary A. With the global parameter omit- 
ted, the view of A is view, = (m; h, x, Y, 10,115 Fu, Fy; u’, v’). We construct a 
simulator S4(m, x, y) as follows: 


~ mw wwenwnsmn f§ 
1. Generate h, Fo, FA, Fus Ty, U’, V! <— R randomly. 
Here, the Hamming weight of FA, Fu, Fy is w,. 
2. Output (m, x, y; h, FA, Fus Fy; W’, V’). 


Since, h, rA, Fus F, and h, FA, Fu, F, follow the same distribution, the following 
equation holds: 
(m, x, y; h, FA, Fus Fv; W, v’) 
SS (3.12) 
=s (m, X,Y; h, TAs Tus y5 u, v’). 


At view,,u’=a-uth-rgp tev =a:-v+b.G+s -rg +e, and it holds 


Cu 


h-rpt+ey| _ |In O rot(h) 
| “TB iy) = i Iz a] e ` (3.13) 
B 


Therefore, the adversary of probabilistic polynomial time cannot distinguish 
between (h - rg + €u, S ° rg + e) and uniform random numbers under the assump- 
tion of 3-quasi-cyclic syndrome decoding of quasi-cyclic code. Since u and v are 
also under the 3-quasicyclic syndrome decoding decision assumption, they cannot 
distinguish between u and v and uniform random numbers. Thus, the distribution 
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of u' and v’ also approaches uniform random numbers and satisfies the following 
equation: 

(m, x, y; h, ra, fu, Ty, U’, v’) (3 14) 
=c (m, X, y; h, FAs Fus fys u', v’). 


Thus, the distributions of the view view, of A and the simulator S4 are indistin- 
guishable against polynomial-time adversaries: 


Sa(m, x, y) 


(3.15) 
=, viewa(m, x, y; h, FA, Fus ry; u,v’). 


Next, consider the semi-honest adversary B. With the global parameter omit- 
ted, the view of B is viewg = (a, b; h, S, U,V, rB, €u, €y). Configure the simulator 
Sg(a, b) as follows: 


Y w wt tat aw mwmw $ a F 
1. Randomly generate h, S, U,V, FB, €u, €y <— R. Here, the Hamming weight of 
Fp is w,, and the Hamming weight of e, and & is we 
2. Output (a, b; h, S, l, V, FB, en, &). 


Since, h, rg, Fu, r, and h, Fp, es F, follow the same distribution, the following 
equation holds: 


a (3.16) 
u,v 
Note that s can be reduced to 2-cyclic syndrome decoding decision, and the 


distribution cannot be distinguished from uniform random numbers for the adversary 
in polynomial time. Therefore, the following equation is satisfied. 


(a,b; h, 8, 
=c (a, b; h,s, 


9r s Cus Cy 
is ) (3.17) 
y 


7] 
U,V, FB, eus €y). 


’ 


Moreover, since u and v are indistinguishable between (h - rg + ey, S ° rg +e) 
and uniform random numbers based on the assumption of quasi-cyclic syndrome 
decoding and the adversary of probabilistic polynomial time cannot be distinguished, 
the following holds: 


(a, b; h, S, u,v, TBs Cus ey) 
(3.18) 
=. (a,b; h, S, u,v, rp, €u, ey). 


Therefore, the distributions of the view view g of B and the simulator Sg cannot 
be distinguished against the adversary of polynomial time: 


Sp(a, b) 


3.19 
=c view g(a, b; h, S, U,V, FB, Cus e). ; ) 


3 Secure Primitive for Big Data Utilization 55 


The above protocol works over F37, but one can see that this can be easily extended 
to a larger field F; by using appropriate error-correcting linear codes over F4. 


3.2.5.3 Secure Comparison 


Two-party secure comparison protocol proposed in this section is based on the size 
comparison method used in the secure decision tree classification protocol of Wu et 
al. [23]. In this section, we used the following criteria given in Proposition 3.1 for 
comparison. 


Proposition 3.1 For at-bit x, y, if there is ani € [t] such that the following expres- 
sion holds, then x < y. 


x —yt14+3) Gj @y;) =0. 


j<i 


In this section, we introduce the proposed protocol for two-party secret com- 
parison protocol. The proposed protocol for two-party secret comparison protocol 
uses a quasi-cyclic code and an arbitrary error-correcting code (For example, Reed- 
Solomon code) on F,. The participants in the protocol are Alice (A) and Bob (B). 
The input of A is c € N, and the input of B is d € N. The output of A is the result of 
the comparison between c and d, and the output of B is none. 

The flow of two-party secret comparison is shown as follows: 


Protocol Two-party secret comparison protocol 


Input A:ceN 

B:deN 
Output A : Comparison result of c and d 

B: L 

1. A and B perform binary expansion of c and d for each input so that c = 

cıc2... C1, d = didz . . . dı. Then, each bit c;, d; is padded to make c;, di, i € [L] 
of k bits. In addition, they set the global parameter param = (n, k, ô, wy, w,) and 
the generator matrix G € Fexn of code C. 


2. A generates random h oe R. Furthermore, (x, y a R?) with Hamming weight 
wx is generated. Private key sk = (x, y), and public key pk = (h,s =x +h-y). 


3. A generates arandomr4;, Fui, Fyi 2 R, į € [/] with Hamming weight w,. Then, 
A computes u; = h - rAi + fui and v; = c; » G +8 + rAi + fyi for l pairs and 
sends / pairs of ciphertext u;, v; to B. 


4. B generates (rpj, Cui, evi) 2 R? with Hamming weight w* and computes 
the expression c; — di + 1 + 3 „2; (€w ® dw) for ci. Specifically, B substitutes 
plaintext d; fori € [/] in the above formula and sets appropriate aji, a2;,..., Ali, 
bi. B computes u;’ = ay; -U1 +--+ + h + rp; + eui andy;’ = aj; -v1 +--- +5; - 
G + S + rgi + e,; for l pairs. Then, the order of (u;', v;') of l pairs is randomly 
replaced and sent to A in a random order. 
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5. A computes v;’ — u;' - y for each i € [/] and decrypts the result. If there is 0 in 
the first bit of the decoded results, c < d is output. Conversely, if there is no 0, 
c > d is output. 


Protocol Description 


1. In step 1, A and B expand c and d of each input to l-bit binary input, so that 

c = c12. ..cı and d = didə . .. dı. Where c;, d;, i € [L] is the ith digit of c, d, 
and / is the bit length. To encode, pad each input to c;, d;, i € [/] with bit length 
k. 
In addition, set global parameters. n is the code length, k is the number of infor- 
mation bits, 6 is the maximum number of errors that can be corrected by the 
error-correcting code, and w, and w, are the Hamming weights set in advance. 
The public parameter G is the generator matrix(For example, the Reed-Solomon 
code generator matrix) of the error-correcting code C, which maps the message 
and code length as F% => F; 

2. In step 2, A generates a private key and public key for HQC encryption scheme. 

3. Instep 3, A uses the public key and encrypts each of the c; pieces. Send (u;, v;), i € 
[/] of the encrypted result to B. 

4. Step 4 uses Proposition 3.1 for the evaluation of c; — d; + 1 + 3 ee (cy ® dy). 
In other words, c < d if i € [L] exists such that 


ci — di +143) (cy ® dy) =0. (3.20) 


w<i 


In particular, since B has plaintext d; and encrypted c;, Eq. (3.20) can be regarded 
as an equation with c; as an unknown and can be computed. In addition, for XOR 
operations, B can transform x; ® y; into 


_ fxi (yi = 9) 
xi ® yi = | L= a O= 1): (3.21) 


Therefore, the XOR operation requires only the additive homomorphism of HQC 
encryption scheme. 
That is, B substitutes plaintext d;, i € [/] into the above equation, sets the appro- 


priate a1;, a2;,..., Ali, bi, and computes as follows: 
Uj = diU +: + +a Ur Hh git eui- (3.22) 
vi’ = dji Vit: -+a vi tbi G+S-rgit ei. (3.23) 


Here, the Hamming weight of F'Bi, Cui, evi, i € [L] is w*. 
Furthermore, to not leak the information about which bits are different to A, B 
needs to replace the order of each (u;’, v;") computed at random. 
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5. In step 5, A computes v;’ — u;' - y, i € [I]. The result is 


vii — ui- y 
= (ay; : Mm +--+ + ai m): G 
+x- (dui: Fai t+: + di Far + Fgi) (3.24) 


— Y: (dii Fut t+ + Gi Tut + Cni) 
+ (aii Ma +++ + aii Fyt + eri). 


Then, the evaluation result is decoded by the error-correcting code. A takes out 
the first 1 bit of each of / decoding results, and outputs c < d if there is 0 in it. If 
there is no 0, c > d is output. 


3.2.5.4 Correctness and Security of the Proposed Protocol 


Correctness 

First, we explain step 4 w*. The Hamming weight of the polynomial coefficient 
vector x, y is wy, and the Hamming weight of r4i, Fuis Fyi, i € [I] is w+. Since each 
is selected uniformly and independently, the probability of each bit value of the vector 
is expressed as follows: 


_. _JOwp. 1—p 
Similarly, 
fait) E (3.26) 
Ai, j ui, j vi, j 1 w.p. Pr = we 
Let L be the set of ayi, azi, ..., qj Æ O in each ay; -ray + Qai -Tazt---+ay- 


rai for the expression i € [J]. 
L = {ai lagi FO} 


Let |L| be the number of elements in set L. Set the Hamming weights w* for 
TBis Cuis evi be as follows: 


w* = (n — |L] + 1)w,. 
Thus, the value of each w* can be determined based on the nonzero numbers in 


a; andi € [L]. 
Next, we analyze the validity of the proposed protocol. 
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The legitimacy of the proposed bilateral linear function secure computation pro- 
tocol clearly depends on the decoding ability of C. Set the v’ — u’ - y error to e. For 
the error correction capability of code C, the error is 


€= X- (ay rai t:e +a ra +rpi) 
— Y: (ii Tuy t+ + Gi Tut + Cni) (3.27) 
+ (aii Mit: ++ + Gii © Thy + ey). 


In other words, if € < 6, decoding is successful. Here, ô is the maximum number 
of errors that can be corrected by error-correcting code C. In addition, in order to 
analyze the validity of the proposed protocol, we generalize the validity of the HQC 
encryption scheme proved by Gaborit et al. [20]. 

The following proposition holds for the Hamming weight of the error. 


Proposition 3.2 There are polynomial coefficient vectors x = (X1,...,Xn) and 
r=(R,,...,R,), and y=x-r=(%,...,Y,). The probability that the sum of 
the random variables Y;,i € [n] on F, is 0 is 
1 
Pritt +--+ % = 01= 2{1+ ao D). (3.28) 


Where the probability distribution of the random variable Y; is 


0 w.p. po=1-—p 
1 w.p. Pl = 74 


q-—1 wp. 2 a 
Proof For Y;, the following equation holds: 
Pr[ Yi +---+ Yn = 0) 


! s ; 
= 2 (o m ) vf pit, (3.30) 
ig! +++ ig—1! H 


ioti t: = +ig-1 =n 
ip-O+i)-14+---+ig-1-(g—D=0 


where ig, ..., ig—1 is the number of times the corresponding 0, ...,q — 1 appears. 
From the polynomial theorem, the following equation holds: 
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{pot pit...+Pg—1}" + {pot (w) pit: ++I") pa)" 
tact ipti pei I pa}? 
n! ) io ig-1 
> sy | Po «°° Pg 
ees (= wee ig-1! 
{1 Pe)? Yt 


; ; é 3.31 
+ (wg) ™ we tae go i ( ) 
= 5 (m) pt pit 
iot+ig- =n io! ++ + ig—1! j = 
(= 
A+ itt Bate @ Dior re 
4 ge ee Di. 
Where wy, is the q root of | and has the following properties: 
Ltu,+upt---+wi" =0 (3.32) 


Substituting io -O +i, -1+---+ig_1 © (q — 1) = 0 into Eq.3.31 can be trans- 
formed as follows: 
{po + pi ++ + paai” 
+ {po + (wapi + + TD pa H 
+ {po + (wg)? prt a (3.33) 
n! 
2 2 (= sali 


iot +ig-1 =n 
io 0+--+ig-1:(q4—1)=0 


Substituting Eq. (3.33) into Eq. (3.30), the proposition holds: 


Pr[¥; +--+ Yn = 0] 


1 
ae a eal 
-1 —lyq-1 n 
+ (po + Wg)? pi te + (wg T pa-1)"} (3.34) 


1 n P 2 —ly\\n 
=o (l= p+ —— lu tog ee +w] D" a- D 
q q-1 


1 q n 
=+ f1+(1- p) -@- 0}. 
q q-1 
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In addition, the following analysis is the same as the validity analysis in Gaborit 
et al. [20]. According to the analysis result of [20], in the case of F2, the decoding 
failure rate can be controlled by setting an appropriate code space size n and noise 
Hamming weights wx and w,. Therefore, in the case of F}, it can be expected that 
the decoding failure rate can be controlled by setting the appropriate parameters. 


Security 

This section describes the security of the proposed secret comparison protocol. 
First, consider semi-honest adversaries A and omp A= (c<d). Omitting pope 

parameters, A’s view is view, = (c, x, y; h, eek TE {rui}! i; {ry}! b {u;'}! Db 

{v;' Kp. However, the first bit is O only for tis’ — - y with index ix. The simu- 

lator S4 (c, x, y) is configured as follows: 


1. Generates h, {Fai}! p ut 1 (i; p {ur hi py 1 ÈR at random. 
Here, the Hamming weight of (Fan! E {Fai} 13 {ri}! ı is w,. It also selects 
pee ix € I, the first bit of u;,’—vj;,’- y is 0, and the first bit of other 
{uy -vy yy Lite is non-zero. 


2. This replaces {u;’ Ve i wy , at random to make {u;' V- 13 wh- ı in random 
order. 


3. This outputs (c, x, y; h, {Fz} p {fui}! p {Pik p {uy K- pW Ba 


Since h, {rai} (uila (vila and h, (Fi) a} a} follow the 
same distribution, the following equation holds: 


CROEN A E A, 
=,(h, {TAi 1, (Fui Y1 ri Y1). 


(3.35) 


From the assumption of quasi-cyclic syndrome decoding of quasi- ae des, the 
probabilistic polynomial time adversary cannot distinguish between uj’, vj’, j € [J] 
and uniformly random ones. Furthermore, since i h- ı and w K- , are replaced 
randomly, the first bit is 0, and the index of úrs — - y where the index ix is a 
uniformly random one satisfying the following an fect 


(0 a1. Wa) Se (fui Fay MY). (3.36) 


Therefore, the distribution of the view view, and simulator S$, when A is 
output , = (c < d) is indistinguishable against polynomial time opponents. 

Semi-honest adversary A and output, = (c > d) are the same as the security 
proof in the case of output, = (c < d), so details are omitted. 

Next, we consider semi-honest adversary B. Omitting the global parameters, 
B’s view is views = (d; h, s, {u;})_,, Wilp {Bi} {eui} {evi}i_,). Config- 
ure simulator Sg (d) as follows: 


OF nk pips pws — ~ ~ $ 
1. Generates h, 5, {Ñ} pE {Fei}! {eu}! 1 {en}! ı <— R at random. 
Here, the Hamming weight of {7p;}!_,, {€u:}/_,, {@}/_, is w*. 
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2. This outputs (d; h, 5, (Wi Yp Pitar (FB p {Cui} ap ih) 


Since h, {rgi} I> {eni}! 1? {eri}! 1 and h, {Fpi}! 1> OA I {ei}! 1 follow the 
same distribution, the following equation holds: 


(h, {rpidiey, {ui}, leria) 


=,(h, {pilan {Ga}, eka). 


(3.37) 


s can be reduced to a 2-quasi-cyclic syndrome decoding decision assumption, and 
the distribution is indistinguishable from uniform random numbers for probabilistic 
polynomial-time adversaries. Thus, § =, s holds. 

In addition, since u;, v;i, i € [/] are based on the assumption of quasi-cyclic syn- 
drome decoding, an adversary in probabilistic polynomial time cannot distinguish 
between u;,v;,i € [/] and uniform random numbers. 


{ai}. Fo) Se Cui, Vika). (3.38) 


Therefore, the distribution of B’s view view g and simulator Sz is indistinguishable 
against polynomial time adversaries. 


3.2.6 Support Vector Machine from Secure Linear Function 
Evaluation and Secure Comparison 


We can construct a code-based protocol for a support vector machine from the proto- 
cols for evaluation of linear functions and comparison described above. Note that the 
result of secure evaluation of linear function is in F} while that of secure composition 
is a bit string. Therefore, we need to provide secure bit-decomposition protocol. The 
bit-decomposition protocols have been already studied well in the research area of 
secure computation, and indeed, we can use the bit-decomposition protocol given 
in [24] with secure computation protocol from a threshold homomorphic encryption 
[25]. (It is straightforward to construct a threshold version of HQC scheme by setting 
Ska = (xı, y,) andskg = (x2, y2) as distributed decryption keys for A and B. Then, 
the encryption key is (h, (x; + x2) + h - (y1 + y2)). 

We describe the overview of the protocol below. For simplification, we denote 
[m] as the ciphertext for m under HQC encryption scheme over Fy. 


Protocol 


Input A:me Fy 
B:a,b,t € Fg 

Output A:a-m+b > tornot 
B:L 
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. Aand B perform the secure linear evaluation protocol over F,. Then, B sends A 


[a -m + b] at step 4 in the original protocol. 

A and B start the secure bit-decomposition protocol on [a -m + b]. 

From the result of the bit-decomposition protocol, B obtains the binary represen- 
tation [(a -m + b)i], ..., [(a -m + b)e]. 

A and B perform the secure comparison protocol from step 4. 
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Chapter 4 ®) 
Secure Data Management Technology ly 


Tomoaki Mimoto, Shinsaku Kiyomoto, and Atsuko Miyaji 


Abstract In this chapter, we introduce data anonymization techniques for several 
types of datasets. Data anonymity of anonymized datasets is an index for estimat- 
ing the (maximum) reidentification risk from anonymized datasets and is generally 
defined as a quantitative index based on adversary models. The adversary mod- 
els are implicitly defined according to the attributes in the datasets, use cases, and 
anonymization techniques. We first review existing anonymization techniques and 
the adversary models behind the data anonymity definitions for anonymization tech- 
niques; then, we propose a common anonymity definition and its adversary model, 
which is applicable to several types of anonymization techniques. Furthermore, some 
extensions of the definition, which is optimized for specific types of datasets, are pre- 
sented in the chapter. 


4.1 Introduction 


Secure data management is a key issue in personal data distribution and analysis. 
Anonymization techniques have been used to harmonize the utility of data and their 
privacy risks. These techniques transform personal data into anonymized data to 
reduce the success probability of reidentification of data principals from the data. If 
the data are well anonymized, they cannot be connected to a person; thus, the privacy 
of the person is protected by anonymization techniques. 

Secure computation is sometimes a realistic solution for commercial services 
due to its cost for data of very large size. Some anonymization techniques work 
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on commercial services as a “practical” solution, even though the size of the data 
is very large. Thus, anonymization techniques have been applied for personal data 
distribution and data analysis. For example, k-anonymization was first proposed as a 
practical solution to reduce the reidentification risks of public data; since then, it has 
been considered to be able to be used for the secure management of personal data. 

Quantitative measures for anonymity are required for estimating privacy risks and 
assessing the feasibility of privacy requirements. In several studies on anonymization, 
privacy notions providing quantitative measures for anonymity have been defined for 
each anonymization technique; however, no common notion for all anonymization 
techniques has been presented to date, which means that each privacy notion is not 
universal but is localized, and heuristic approaches are still used to harmonize the 
usability of data and privacy risks through whole processes or services. A common 
notion is required for consistent secure data management for the whole process. 

In this chapter,! we discuss a new common privacy notion based on an adver- 
sary model, which is applicable to several anonymization techniques, and intro- 
duce a novel anonymization technique and implementation of the technique. In 
Sect.4.2, we revisit adversary models on several anonymization techniques and 
review anonymization techniques. We propose a common adversary model and quan- 
titative measures using the adversary model are presented in Sect. 4.3. An extension 
is discussed in Sect. 4.4. Our implementation of an anonymization tool is introduced 
in Sect. 4.5. We conclude this chapter in Sect. 4.6. 


4.2 Anonymization Techniques and Adversary Models, 
Revisited 


The related work presented below is grouped under k-anonymization and noise addi- 
tion as anonymization methods. 
4.2.1 k-Anonymization 


k-anonymity [4—6] is a well-known privacy model. The property of k-anonymity is 
that each published record is such that every combination of values of quasi-identifiers 
can be matched to at least k respondents. 


'This chapter is reprinted from [1-3]. 


4 Secure Data Management Technology 67 


4.2.1.1 Adversary Model 


k-anonymized datasets are assumed to be in public domains. An adversary can obtain 
all the attribute values in a dataset and execute arbitrary operations on the attribute 
values. 

There are few formal definitions or models for the adversary that aim to identify 
the attributes of a certain individual in a k-anonymized dataset. Kiyomoto and Martin 
modeled an adversary [7] for k-anonymized datasets based on two query functions 
as follows: 

Let d be an index of the dth record, qy be a set of m attribute values in T2*, and 
s be a value for the sensitive attribute. The two query functions are defined as: 


e read. For the input of an index value d, the function outputs the dth record. That 
is, f(T*, query = {read, d}) > {d, qf, sf}, where q? and s? are values of the 
quasi-identifier and the sensitive attribute in the dth record, respectively. If the dth 
record does not exist, then the function outputs failed. 

e search. For input qx and/or s, the function outputs the number u of records 
and index values that have a quasi-identifier qx and/or sensitive attribute s. That 
is, f(T*, query = {search, qx, s}) —> u, D, where u and D are the number of 
records and a sequence of index values that have the same quasi-identifier and/or 
sensitive attribute, respectively. If s or g, do not exist, then the function outputs 
failed. 


4.2.1.2 k-Anonymization Algorithm 


This idea is easy to understand, and many types of k-anonymization algorithms 
have been proposed. The Incognito algorithm [8] generalizes the attributes using 
taxonomy trees, and the Mondrian algorithm [9] averages or replaces the original data 
with representative values and achieves k-anonymization. In this paper, we use a k- 
anonymization algorithm based on clustering and denote A;(D) as k-anonymization 
for dataset D. The algorithm finds close records and creates clusters such that each 
partition contains at least k records. For details of the algorithm, see [10]. 


4.2.2 Noise Addition 


Noise addition works by adding or multiplying stochastic or randomized numbers to 
confidential data [11]. The idea is simple and is also well known to be an anonymiza- 
tion technique. 
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4.2.2.1 Adversary Model 


One objective of an adversary against noise-added datasets is to remove the noise 
or estimate the original values from the noise-added attribute values. One potential 
scenario is a probabilistic approach in which an adversary estimates the distribution 
of noise and chooses an attribute value with high probability. There is no formal 
adversary model on static noise-added datasets, but Differential Privacy settings 
assume data include dynamically added noise, and their adversary simulations are 
defined as query-based. 


4.2.2.2 Anonymization Algorithm by Noise Addition 


The first work on noise addition was proposed by Kim [12], and the idea was to 
add noise € with a distribution € ~ N(0, 07) to the original data. Additive noise is 
uncorrelated noise and preserves the mean and covariance of the original data, but the 
correlation coefficients and variance are not retained. Another variation of additive 
noise is correlated additive noise, which keeps the mean and allows the correlation 
coefficients in the original data to be retained [13]. Differential privacy is a state-of- 
the-art privacy model that is based on the statistical distance between two database 
tables differing by at most one record. The basic idea is that, regardless of background 
knowledge, an adversary with access to the dataset draws the same conclusions, 
irrespective of whether a person’s data are included in the dataset. Differential privacy 
is mainly studied in relation to perturbation methods in an interactive setting, although 
it is applicable to certain generalization methods. 

In this paper, we use Laplace noise as a noise addition and add noise € ~ 
Lap(0, 27) to each attribute. We denote Ag(D) as noise addition for dataset D. 


4.2.3 K-Anonymization for Combined Datasets 


We introduce an adversary model for a combined dataset from datasets produced by 
two service providers and anonymization methods [14]. 


4.2.3.1 Adversary Model 


If we consider the existing adversary model and assume that the anonymization 
tables produced by the service providers satisfy k-anonymity, the combined table 
also satisfies k-anonymity. However, we have to consider another type of adversary 
in our new service model. In our service model, the combined table includes many 
sensitive attributes; thus, the adversary can distinguish a data owner using background 
knowledge of combinations of sensitive attribute values of the data owner. If the 
adversary finds a combination of known sensitive attributes on only one record, the 
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adversary can obtain information; the record is a data owner that the adversary knows, 
and the adversary also knows the remaining sensitive attributes of the data owner. 
We model the above type of new adversary as follows: 

mt -knowledge Adversary Model. An adversary knows certain z sensitive attributes 
{si, oe s$, a si} of a victim i. Thus, the adversary can distinguish the victim with 
an anonymization table in which only one record has any combinations (maximum 
x -tuple) of the attributes {si, re si, Pee si}. 


4.2.3.2 Modification of Quasi-identifiers 


The first strategy is to modify the quasi-identifiers of the combined table. The data 
user generates a merged table from two anonymization tables as follows: First, the 
data user simply merges the records in the two tables as |q%|s"" ,|s‘\|5%|. Then, the 
data user modifies qe to satisfy the following condition, where 0 is the total number 
of sensitive attributes in the merged table. 


4.2.3.3 Modification of Sensitive Attributes 


The second approach is to modify the sensitive attributes in the combined table for 
the condition. If a subtable ist B is’ DA that consists of sensitive attributes is required 
to satisfy k-anonymity, some sensitive attribute values are removed from the table 
and are changed to x to satisfy k-anonymity. Note that we do not accept that all 
sensitive attributes are x due to having no information record. 


4.2.3.4 Algorithm for Modification 


One algorithm that finds a kK-anonymized combined dataset is executed as follows: 


1. The algorithm generalizes quasi-identifiers to satisfy the condition that each group 
of the same quasi-identifiers has at least m x k records. 

2. The algorithm generates all the tuples of sensitive attributes in the table. 

3. For each tuple, the algorithm finds all the records that have the same sensi- 
tive attributes as the tuple or has « for sensitive attributes and makes them a 
group. We define the number of sensitive attributes in the group which is 0. The 
algorithm generates a partial table that consists of 0 — x sensitive attributes and 
checks whether the partial table has at least k different combinations of sensitive 
attributes. 

4. If the partial table does not satisfy the above condition, the algorithm chooses a 
record from other groups that have different tuples of x sensitive attributes and 
changes the zr sensitive attributes to x. The algorithm executes this step until the 
partial table has up to x different combinations of sensitive attributes. 
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5. The algorithm executes step 3 and step 4 for all the tuples of x sensitive attributes 
in the table. 


4.2.4 Matrix Factorization for Time-Sequence Data 


Some studies have used matrices for time-sequence datasets. Zheng et al. [15, 16] 
proposed predicting a user’s interests in an unvisited location. They assumed users’ 
GPS trajectory as a user-location matrix where each value of the matrix indicates the 
number of visits of a user to a location. The matrix is very sparse because each user 
visits only a handful of locations, so a collaborative filtering model is applied to the 
prediction. Zheng et al. [17] built a location-activity matrix, M, which has missing 
values. M is decomposed into the two low-rank matrices U and V. The missing 
values can be filled by X = UVT ~ M, and locations can be recommended when 
some activities are given. Chawla et al. [18] constructed a graph from the trajectories 
of taxis and transformed the graph into matrices. The authors of [19] proposed a 
method of identifying traffic flows that cause an anomaly between two regions. 


4.2.5 Anonymization Techniques for User History Graphs 


In this subsection, we introduce two anonymization techniques for user history 
graphs, which are proposed in [1]. 


4.2.5.1 Adversary Model 


Privacy leakage from a merged history graph is the disclosure of the actions of a 
particular person from the graph. Attacks against user history graphs are intended 
to obtain the private information of a particular user from the graph. We assume 
that the merging process is executed on a trusted domain and that only the merged 
history graph is published; thus, the adversary can only obtain the merged graph. 
Furthermore, we assume that the adversary has the following knowledge about the 
user: The history of the user is included in the merged graph and the user performs 
an action t. The adversary tries to discover other actions of the user to be able to 
guess which edges connecting to node tf can be assigned to the user. 
We summarize the adversary model as follows: 

Adversary against a Merged History Graph. It is assumed that an adversary knows 
that a victim A executed an action t. The objective of the adversary is to obtain the 
actions that A executed before or after the action t. Thus, the adversary searches the 
merged history graph, which includes actions of other people and finds the actions 
of A using the knowledge that action t was executed. 
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We define privacy notions to use with the above adversary model in a later sub- 
section. 


4.2.5.2 Notions for the Untraceability of a Graph 


We consider two levels of privacy notions: partial k-untraceability and complete k- 
untraceability. Partial k-untraceability accepts the leakage of some partial actions of 
a user but prevents all the actions of the user from being revealed. The definition 
of complete k-untraceability involves meeting the requirement that no action of the 
user is leaked. The symbol Acty for user A denotes the sequence of all the actions 
of user A from action x to action y. For example, the sequence of actions from the 
first action to action x and the sequence of actions from action x to the final action 
are denoted as Act, and Act, „respectively. 


Definition 4.1 (Partial k-untraceability) We assume that an adversary knows an 
action t of a user A, and we consider all the possible adversaries defined for any action 
t of the user in the merged graph. If at least k sequences of actions are potentially 
associated with user A and k — 1, other users exist as candidates for all actions 
Atha and Acty, 3.5 the digraph satisfies k-untraceability for A. If the digraph 


satisfies the above condition for all users, then the digraph is said to satisfy partial 
k-untraceability. 


Definition 4.2 (Complete k-untraceability) We assume that an adversary knows an 
action t of a user A and we consider all the possible adversaries defined for any action 
t of the user in the merged graph. If at least k actions are potentially associated with 
user A and k — 1 other users exist as candidates for each action in Acty. is and 


Acti 5 the digraph satisfies k-untraceability for A. If the digraph satisfies the 
above condition for all users, the digraph satisfies complete k-untraceability. 


Generally, many trivial actions are performed by many users. It is not important 
for privacy purposes where we keep the information about such actions. Thus, we 
relax the above definitions to produce an anonymized graph that includes much of 
the information needed to analyze a user’s history. Let v be the threshold value for 
the number of performing users that establishes that an action is trivial; that is, we 
judge the actions x — y to be trivial if the label L(x — y) > v. Both definitions are 
modified as follows: 


Definition 4.3 (Partial (k, v)-untraceability) We assume that an adversary knows 
an action t of a user A, and we consider all the possible adversaries defined for any 
t in the merged graph. If at least k sequences of actions are potentially associated 
with user A and k — 1 other users exist as candidates for all actions Act and 


Acti a except trivial actions x — y that have a label L(x > y) > v, then the 
digraph satisfies partial (k, v)-untraceability for A. If the digraph satisfies the above 
condition for all users, then the digraph satisfies partial (k, v)-untraceability. 
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Definition 4.4 (Complete (k, v)-untraceability) We assume that an adversary knows 
an action ¢ of a user A, and we consider all the possible adversaries defined for any t in 
the merged graph. If at least k actions are pony associated with user A andk — 1 
other users exist as candidates for each action in Act4 Naan Nd Acti N, aa CXC€PE trivial 
actions x — y that have a label L(x —> y) > v, then the digraph satisfies complete 
(k, v)-untraceability for A. If the digraph satisfies the above condition for all users, 
then the digraph satisfies complete (k, v)-untraceability. 


In a complete (k, v)-untraceable graph, each action ¢ except trivial actions has k 
outgoing edges and incoming edges; thus, an action of user A that connects to action t 
cannot be identified from k candidates. Thus, the graph satisfies untraceability for an 
adversary who knows action ¢ of the user. It is trivial that a complete (k, v)-untraceable 
graph satisfies partial (k, v)-untraceability; all actions except trivial actions are con- 
nected to k potential actions in a complete (k, v)-untraceable graph. A graph that 
satisfies partial (k, v)-untraceability generally produces much more information than 
a complete (k, v)-untraceable graph, where the partial (k, v)-untraceable graph and 
the complete (k, v)-untraceable graph are generated from a user history graph. How- 
ever, the (k, v)-untraceable graph may reveal partial actions of users due to the relaxed 
definition of the privacy notion; an attack is successful when an adversary obtains all 
the actions of a user. To trace all the actions of the user, the adversary has to select a 
sequence of actions from k sequences of actions; thus, all the actions of the user are 
untraceable, even though some actions are traceable by the adversary. The parameter 
k means that an action (or a sequence of actions) is potentially associated with a 
user and k — 1 other users in the untraceable graph, and the parameter v means that v 
users perform the same action in the graph. Generally, we should select the parameter 
v = k with regard to the privacy requirement for a merged graph. The actions of a 
user are hidden in the actions of a group that consists of k members including the 
user. A privacy notion for the graph should be selected from the above two notions 
according to a use case of the graph and its privacy requirements. 


4.2.5.3 Algorithm Generating a Partial (k, V)-Untraceable History 
Graph 


The details of the algorithm are denoted as Algorithm 4.1, where oe, and ie; are 
defined as the number of outgoing edges and incoming edges of a node t, respectively. 
The algorithm for generating a partial (k, v)-untraceable history graph is as follows: 


1. This step consists of a part of the detailed algorithm, from line 1 to line 3. For 
the input of a user history graph G, the algorithm adds a virtual incoming edge 
(s, — r) to each node r € start until the number of incoming edges is the same 
as the number of outgoing edges. Then, the algorithm adds a virtual outgoing edge 
(q — uj) to each node q € end until the number of outgoing edges is the same 
as the number of incoming edges. A label of a virtual incoming edge L(s, —> x) 
denotes the number of users who first perform the action, and a label of a virtual 
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outgoing edge L(y — uy) denotes the number of users who perform the action 
at the end. 

This step consists of a part of the detailed algorithm, from line 4 to line 12. The 
algorithm searches for a node ¢ that has fewer outgoing edges than k and for which 
all its lower nodes N;-, ena\; have fewer outgoing edges than k. Then, the algorithm 
removes all the outgoing edges (t —> x) that satisfy L(t —> *) < v. Next, the 
algorithm searches for a node t¢’ that receives incoming edges numbering less 
than k and all upper nodes Norar;1\y that receive fewer incoming edges than k. 
Then, the algorithm removes all the incoming edges (x — t’) that satisfy L(« > 
t') < v. The algorithm repeats this step until no node that meets the conditions is 
found. 

This step is the same as line 13, line 14 and line 15 in the detailed algorithm. 
The algorithm removes virtual incoming and outgoing edges, removes nodes that 
have no edges, and outputs the modified graph. 


Algorithm 4.1 Generation of a Partial (k, v)-Untraceable History Graph 


Input: User History Graph G, parameters k and v 
Output: Anonymized Graph G“ (G, k, v) 


1: 


CHIDMNAYWH 


G%(G,k,v) < G 


: Add virtual incoming edges to start nodes 

: Add virtual outgoing edges to end nodes. 

: T < all nodes t, where vey, ea < k and all of its edges do not have L(t; > *) > v 
: T’ < all nodes t’, where iey 


: while T 4G or T’ 4G do 


saria < K and all of its edges do not have L(x —> t) >v 
Choose t; from T 

Remove all outgoing edges of t; where L(t; > *) < v from G“ (G, k, v) 

Choose t from T’ 


Remove all incoming edges of t; where L(x —> ti) < v from G° (G, k, v) 
Update T and T’ 

: end while 

: Remove virtual edges 

: Remove all nodes t” where oe, = 0 and ie,” = 0 from G*(G, k, v) 


: return G“ (G, k, v) 


4.2.5.4 Algorithm Generating a Complete (k, V)-Untraceable History 


Graph 


The details of the algorithm are denoted as Algorithm 4.2. The algorithm for gen- 
erating a complete (k, v)-untraceable history graph is as follows: 


l. 
2. 


The algorithm first executes Algorithm 4.1 except line 13 and line 15. 
This step consists of a part of the detailed algorithm, from line 3 to line 11. The 
algorithm searches for a node ż that has fewer outgoing edges than k and removes 
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all the outgoing edges (t —> x) that satisfy L(t —> *) < v, until no node is found. 
Then, the algorithm searches for a node ¢’ that receives fewer incoming edges than 
k and removes all the edges (x — t’) that satisfy L(x —> t’) < v. The algorithm 
repeats this step until no node that meets the conditions is found. 

3. This step consists of line 12, line 13, and line 14 in the detailed program. The 
algorithm removes virtual edges, removes nodes to which no edge is connected, 
and outputs the modified graph. 


4.2.6 Other Notions 


Differential Privacy [20, 21] is a notion of privacy for perturbative methods based on 
the statistical distance between two database tables differing by, at most, one element. 
The basic idea is that, regardless of background knowledge, an adversary with access 
to the dataset draws the same conclusions whether a person’s data are included in 
the dataset. That is, a person’s data have an insignificant effect on the processing of a 
query. Differential privacy is mainly studied in relation to perturbation methods [22— 
24] in an interactive setting. Attempts to apply differential privacy to search queries 
have been discussed in [25]. Li et al. proposed a matrix mechanism [26] applica- 
ble to predicate counting queries under a differential privacy setting. Computational 
relaxations of differential privacy were discussed in [27—29]. Another approach for 
quantifying privacy leakage is an information-theoretic definition proposed by Clark- 
son and Schneider [30]. They modeled an anonymizer as a program that receives two 
inputs: a user’s query and a database response to the query. The program acted as a 
noisy communication channel and produced an anonymized response as the output. 
Hsu et al. provides a generalized notion [31] in decision theory for making a model 
of the value of personal information. An alternative model for the quantification of 
personal information is proposed in [32]. In the model, the value of personal infor- 
mation is estimated by the expected cost that the user has to pay for obtaining perfect 
knowledge from given privacy information. Furthermore, the sensitivity of different 
attribute values is taken into account in the average benefit and cost models proposed 
by Chiang et al. [33]. Krause and Horvitz presented utility-privacy tradeoffs in online 
services [34, 35]. 


4.2.7 Combination of Anonymization Techniques 


A combination of anonymization methods leads to the construction of datasets that 
are useful and that preserve privacy. Some countries publish census data, and they 
combine several anonymization methods, such as generalization, noise addition, and 
sampling [36, 37]. However, some problems remain. One problem is that it is difficult 
to evaluate the privacy risks of anonymized datasets when anonymization methods are 
combined. Some research is available about the relationships among anonymization 
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methods. Chaudhuri et al. proposed (c, €, 5)-privacy [38] and studied the relationship 
among sampling and differential privacy [39]. Li et al. proposed (£, €, 5)-differential 
privacy and studied the relationship among sampling, differential privacy, and k- 
anonymity. Soria-Comas et al. proposed a k-anonymized algorithm for differential 
privacy using an insensitive algorithm [40]. 


4.3 (p, N)-Identifiability 


4.3.1 Common Adversary Model 


Existing privacy measures are supposed to protect against idealized attackers, and it 
is difficult to maintain their utility and assess their reidentification risk. We designed 
adversary models to describe more realistic attackers by structuring a real setting 
for the attackers. In the case of exchanging anonymized datasets between compa- 
nies, for instance, a data-providing company first anonymizes and encrypts datasets 
for transmission to a receiver company via a secure channel. The receiver com- 
pany locates the dataset in a secure room and allows only authorized employees to 
access the anonymized dataset. This process can reduce the reidentification risk in 
the anonymized dataset, and it specifies the attacker and limits the ability to access 
datasets so that the attacker must know the quasi-identifiers of the neighbors or 
acquaintances. For example, it seems to be quite rare for an attacker to know all the 
quasi-identifiers of a target because the target is a neighbor of the attacker. Thus, a 
more stringent analysis of the reidentification risk can be achieved when we assume 
a more realistic situation, such as that the attacker has only limited knowledge of the 
victim. 

Access rights to an anonymized dataset may be given to attackers, and attackers 
may acquire some information about the original dataset or obtain the anonymization 
algorithm used to generate the anonymized dataset. Information about the original 
dataset is categorized into three parts as follows: information on a specified record 
such as a neighbor; the original dataset; and any other information except the target 
information that the attacker is seeking. The case of William Weld, who was governor 
of Massachusetts [41], is a typical example of reidentification, and an attack on the 
Netflix Prize dataset was carried out by a strong attacker who gained access to the 
Internet Movie Database [42]. 

We can consider the abilities of an attacker in two areas: knowledge about the 
dataset and the ability to simulate anonymization algorithms. Many previous studies 
such as [43, 44] assumed that an attacker has all the information required except 
knowledge of the target of the attack. In this paper, we consider an attacker who has 
knowledge of only the target record and can simulate anonymization algorithms to 
obtain anonymized records that may correspond to the target record. 
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4.3.1.1 Definitions of Actual Attackers 


Generally, when an anonymized dataset is published on the Web, anyone who can 
access the dataset is a potential attacker; thus, the adversary model should be ideal 
because we cannot assume there is only a limited-knowledge adversary, and we have 
to assume all possible adversaries are present. On the other hand, when the dataset 
is managed under strict controls, the model adversary is not considered to be an 
unlimited-knowledge adversary. We design two realistic adversary models under the 
assumption that the dataset is managed in a restricted area (not public) and only a 
limited set of attackers can access the dataset; and then, we propose a privacy metric 
for privacy risk analysis. 


Definition 4.5 (Anonymization Simulator f,im) Let Do with no records, Dı with nı 
records, r* [QJ], and r*[SJ] be an original dataset, an anonymized dataset generated 
from the original dataset, the quasi-identifiers of a record r* € D,, and sensitive 
information from the record r* € D,, respectively. An anonymization simulator fim 
simulates an anonymization algorithm used to generate an anonymized dataset as an 
oracle and outputs ri [QT] € D; forthe input r?[Q/] € Do. Thatis, fsim : rl] > 
{r'[Q/], 1} , where r![Q/] is a set of r] [QI] and no output is produced in the case 
of L. 


The simulator is a deterministic process for deterministic anonymization, such 
as top-coding and bottom-coding, and a probabilistic process for probabilistic 
anonymization, such as random sampling. The simulator can provide access to Do 
to simulate the anonymization algorithm, even though no adversary can access Do. 
Next, we define two adversary models. 


Definition 4.6 (Deanonymizer for Anonymized Datasets, DA) When ArflQl ]e 
Do, Wr}LQI ||SZ] € Dı and fiin are given, a deanonymizer DA lines up poten- 
tial candidates r} corresponding to r? by executing the simulator fim, then, the 


deanonymizer DA outputs a list of candidates rl [QI||SI] for rS; where the number 
of records in the list is n,, the number of sensitive information items in the list is n, 
and 0 < ns < nq < no. 


If an attacker knows the actual anonymization function f, the attacker can use f 
as fsim, and the evaluation result should be more credible. 


Definition 4.7 (Reidentifying Adversary versus Anonymized Datasets) When 
Ar}lQl ] € Do, vrl[QI I|SZ] € Dı and fsim are given, a reidentifying adversary 
executes the deanonymizer DA and can identify r}, which is a record of the same 
person in the record r°, from the records in a dataset Do, where ro € Do is given. 


The success probability of the attack is calculated as 1/n, when rj is included in the 
output by DA; otherwise, it is 0. 


Assuming an attacker who has 3 QI ] € Do is the same as assuming |Do| 
attackers who have AG = 1,...,|Dol) € Do. 
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Definition 4.8 (Revealing Adversary versus Anonymized Datasets) When 
Ariel] € Do, Wr} LQT||ST] € Dı and frim are given, a revealing adversary exe- 
cutes the deanonymizer DA and finds a rs I] from i [SI] such that rl is a record 
of the same person as the record r°. The success probability of the attack is calculated 
as 1/ns when r Í is included in the output of DA; otherwise, it is zero. 


A revealing adversary does not try to identify the record but tries to access sensitive 
information. In other words, the attacker seeks only to obtain sensitive information 
from the record in question. More precisely, the success probability of the revealing 
adversary can be calculated as [n;]/n,, where the correct number of sensitive items 
in the list is [n;], but the probability itself may be uncertain. Assume that when 
the probability is 0.99, some attackers are convinced that the target should be the 
majority. Furthermore, in the case that the deanonymizer DA is leaked and the fsim 
used in the deanonymizer is a deterministic process, an attacker can infer the sensitive 
information of r9. On the other hand, when the fim used in the deanonymizer is a 
probabilistic process, even if DA is leaked, outputting the result should not involve 
uncertainty. 


4.3.1.2 (p, N)-Identifiability 


Here, we assume that anonymized datasets are strictly controlled and that the attacker 
has knowledge of a specific record and the anonymization algorithms. We assume 
that the attacker is the strongest type of attacker and has knowledge of the most 
characteristic record. Nevertheless, it is difficult to quantify this characteristic, so we 
assume that each attacker has an original record. In other words, we assume there 
are as many attackers as there are original records. 


Definition 4.9 ((p, N)-identifiability) Let p be the success probability for an adver- 
sary who has J3ır°[QI] € Do, Wr} [OT||ST] € Dı and fsim, and N be the number of 
adversaries whose attack success probability is p. 


The probability p is the conditional probability that the adversary can select the 
correct record from the list produced by the deanonymizer DA when the collected 
record is included in the list. The probability that the deanonymizer successfully 
produces the list, including the correct record, depends on the anonymization algo- 
rithms. 

Our model can extend to an adversary who has knowledge of two or more records. 
For simplicity, we use an adversary model that knows a single record and consider N 
single knowledge adversaries in our risk analysis. The idea of (p, N)-identifiability 
is studied in [2]. 
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4.3.2 Success Probability Analysis Based on the Common 
Adversary Model 


In this section, we assume the attackers described in the previous section and 
explain the calculation to obtain the success probability of attacks on representa- 
tive anonymization methods: generalization, noise addition, and sampling. We con- 
sider that fsim is constructed as a typical combined algorithm selected from three 
anonymization algorithms, fgeneralization» Ssampling Nd fnoise. We explain the above 
three anonymization algorithms and show combined anonymization using an exam- 
ple dataset. 


4.3.2.1 Generalization 


We include deletion of records or cells and top- or bottom-coding as steps in gen- 
eralization. One step Of feeneralization iS similar to k-anonymity in checking the 
number of identical combinations of quasi-identifiers. When an anonymized dataset 
has k-anonymity, p equals 1/k. k-anonymity is an intuitive privacy metric, but the 
greater the number of attributes, the more difficult it is for the datasets to achieve 
k-anonymity. If an attacker has generalization trees for each attribute, the attacker 
adds records which satisfy the requirements of the trees of the list of candidates. 
When there is a record whose address attribute is Tokyo, for instance, an attacker 
who has the generalization tree adds records whose addresses are in the Kanto region 
as well as records whose addresses are in Eastern Japan to the list of candidates. It is 
appropriate that an attacker can infer the generalization tree and in our experiment, 
Jsim can be considered capable of accessing the generalization trees of each attribute. 


4.3.2.2 Random Sampling 


When an attacker who has one original record is assumed, the privacy risk differs 
greatly among the original datasets. Consider an original dataset with many unique 
records, and assume that random sampling is implemented. Let M be the number of 
unique records and œ be the sampling rate. The probability that unique records will 
not appear is (1 — w)”. Even when œ = 0.1 and M = 44, the probability is less than 
0.1%. When a large dataset is anonymized, it is possible that there will be more than 
44 unique records, which shows that if sampling is implemented, a characteristic 
record may be identified or suspected. 

We evaluate sampling as follows: For simplicity, we consider the case where the 
anonymization method is only random sampling. When a unique record is sampled, 
an attacker who knows the person is certain that the record is for that person. Thus, 
the probability p does not change. On the other hand, sampling reduces the number 
of unique records, and N decreases accordingly. When unique records are very few 
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and do not appear in an anonymized dataset, p decreases. We apply this approach to 
the case of combining different anonymization methods. 

The approaches to sampling vary, and we can also consider fsampling iN Various 
ways. For instance, the probability of disclosing the identity of any individual is 
evaluated by using the posterior probability of population uniqueness [45]. 


4.3.2.3 Noise Addition 


There are two cases of noise addition: One is adding noise to the numerical data itself, 
and the other is adding noise to its quantity. In the former case, the data consist of 
original numerical data or data anonymized by a process, such as microaggregation, 
and in the latter case, the data are original quantity data or anonymized data, such as 
11-20 in the age attribute. 

In the former case, we can consider froise as follows. Noise is added based on 
a probability distribution, such as normal, Laplace, and exponential distributions. 
In particular, it has been mathematically proven that adding Laplace noise to the 
output of some queries achieves differential privacy [39], so this type of noise is 
widely used. Therefore, when an anonymized record is included in the 90 or 95% 
confidence interval, the record is added to the list of candidates. More simply, when 
original data and anonymized data have small differences such as 10 or 20% for each 
attribute, the attacker may consider the possibility that they are the same. 

In the latter case, we cannot use the same method. When a record has 72 and is 
anonymized to 95, for instance, the attacker whose target is a specific person may 
not regard the target to be that person. However, the attacker can link them after the 
top-coding is executed and change the value to 70-. On the other hand, when a record 
is 19, is anonymized to 20 and is generalized to 20-29, the attacker may not link 
them. One of the ideas of froise is that a group with each attribute can be changed to 
next group and such records are output as candidates. As in the generalization step, 
an attacker can infer the next group for each group and froise can be thought of as 
defining the distance of each classification. 

The description above shows that when the order of anonymization is changed, 
Fsim Will also be changed. 


4.3.2.4 Combination of Anonymization Methods 


The principles of each anonymization can be combined by evaluating each 
anonymization step by step. Stated differently, an attacker has feeneralizations fsampling> 
and froise aS fsim. We show examples of combined cases by using a sample 
dataset (Fig.4.1). An attacker should change his or her approach when the order 
of anonymization is changed if he or she knows this fact. We assume five attacker 
models, A; to As, in the following example, and the candidates of each attacker 
model are represented as C1 to Cs. We denote C; of r; in the following figures as the 
candidates of an attacker A; who has r; as a target. The adversary model for A, to 
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Fig. 4.1 Sample dataset ne ATT z ATTR>| ATTRs 


178 Hospital 
12 31 179 Office 
i 38 | 165 Office 
T4 30 180 Shop 
Ts 27 167 Hospital 
% 29 171 Shop 
T7 33 173 Hospital 


Ag is the reidentifying adversary defined in Definition 4.3, and the adversary model 
in Fig. 4.4 is the revealing adversary defined in Definition 4.4. 

Let the conditions of attackers be as follows: A; and A3 do not consider noise- 
adding and generalization but simply compare rl € Dı with r? € Do. This is one 
approach to fnoise ANd feeneralization- On the other hand, Az, A4, and As do consider 
the added noise and generalization. We define the noise addition shown in Fig. 4.2 
as follows: the classifications of each attribute change to the next classification with 
a certain probability. We assume Az knows the rule of noise addition and that froise 
of Az outputs candidates that have a different classification in one attribute from an 
original record. On the other hand, let a small amount of noise be added in step (a) 
of Figs. 4.3 and 4.4. We assume the attackers A4 and A5 know the rule and that froise 
of A4 and As outputs candidates whose values of ATT R; are different but within 2 
from the original record and whose values of AT T R, are different but within 4 from 
the original record. In the figures, the boldface sections show that the classifications 
are not correct but are within the permissible range for fioise Of A2, A4, and As: 
The red boldface sections show that there are substantial distances from the original 
values and that attackers who have the record cannot link them. 


4.3.2.5 Examples of Analyses 


The Case of A, 
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Fig. 4.2 Sample anonymization and the result of simulation attack 1 
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Fig. 4.3 Sample anonymization and the result of simulation attack 2 
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Fig. 4.4 Sample anonymization and the result of simulation attack 3 


Generalization, noise addition, and sampling are executed as anonymizing 
methods in Fig.4.2. In the generalization step (a), all records are generalized to 
be divisible into equal parts. As a result, only r2 is unique, and this dataset has 
(1, 1)-identifiability. 

In step (b), 71, r4, and rg are changed by the addition of noise. As a result, rı and r2 
are indistinguishable. r3, r4, and r7 are also indistinguishable, but rs and rg become 
unique. We define A, as not considering the addition of noise, so that an attacker 
who has re cannot link the original record but an attacker who has rs can. Therefore, 
identifiability becomes (1, 1)-identifiability. 

After sampling, in step (c), r2, r4, and rs do not appear. Then, r3 and r7 become the 
focus are focused and identifiability becomes (1/2, 2)-identifiability. This attacker 
simply checks how many of the same records there are in the dataset. Even if various 
anonymization methods are implemented, some records may not be affected. There- 
fore, it is important to assume such attackers. When we can say that a dataset has a 
certain level of privacy from such attackers, it means that an attacker cannot link the 
target with the original record by accident. 


The Case of A, 

We omit the explanation of step (a) because noise is not added. In step (b), the attacker 
with r1, for example, chooses r1, r2, rs, and r as candidates because one or more of 
their attributes match rı = {-30, 175-}. On the other hand, an attacker with r4 cannot 
output candidates because both attributes of r4 are changed. Hence, identifiability is 
(1/4, 2)-identifiability. In step (c), rs does not appear, and identifiability becomes 
(1/4, 1)-identifiability. 


The Case of A3 
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In Fig. 4.3, the dataset is anonymized by the addition of noise, generalization, and 
sampling. 

In the case of A3, the dataset with added noise is safe enough from attackers 
who do not consider the added noise and we omit this case; however, this does 
not mean that noise addition is safe, and when another attacker, such as A4, is 
considered, the result should be different. In step (b), we focus on the attacker with 
r3. This is the strongest attacker, and this attacker suspects that r2 and r3 are the 
candidates. More specifically, the scope is r3 = {38, 165} = {31-, -174} and r2, r3 
meet the requirement. The attacker with r2 seems to have the same risk but cannot 
identify the actual target r2 is a possible candidate because the noise of ATT R3 is 
great enough. Hence, the identifiability becomes (1/2, 1)-identifiability. In step (c), 
r3 does not appear, and the privacy risk is (1/3, 1)-identifiability. 


The Case of A4 

Next, we show the case of A4. In step (a), every record but rı and r7 has enough added 
noise, and attackers cannot infer which is the correct record. The attacker with r7 
regards the records within {33 + 2, 173 + 4} as candidates. Only r7 satisfies the con- 
dition, and the privacy risk is (1, 1)-identifiability. In step (b), the effect of noise addi- 
tion becomes weak, and the number of attackers who should be considered increases. 
The attacker with r6, for instance, regards the records within {29 + 2, 171 + 4} = {(- 
30, 31-), (-174, 175-)}, namely, all records, as candidates. The privacy risk becomes 
(1/2, 1)-identifiability after generalization is finished. In step (c), similar to the pre- 
vious steps, the privacy risk becomes (1/3, 1)-identifiability. 


The Case of A5 
Finally, we show an example of a revealing adversary. 

An attacker can claim to succeed when the sensitive information ATT Rs of the 
target can be correctly identified. Step (a) is similar to that of the case of A4. In 
step (b), the attacker with r3 suspects r2 and r3 are the candidates. Their ATT Rs are, 
however, “Office” and the attacker claims to identify the person. Thus, the privacy risk 
is (2/2 = 1, 1)-identifiability, which is similar to /-diversity. In step (c), the attacker 
with rı suspects rı, r4 and re are the candidates; the ATT Rs of rı is “Hospital,” and 
that of the others is “Shop.” Therefore, the probability of reidentification is 1/2. More 
precisely, the probability is 1/3 because there are three candidates and one is correct, 
but the probability may be important information for the attacker with rı. The same 
can be said of the attacker with r7; therefore, the risk according to our definition is 
(1/2, 2)-identifiability. 

As described above, when the adversary model is different, the result of the risk 
is also different. Assuming attackers who disregard noise, we consider the risk to the 
records whose fluctuations are due to anonymization to be small. On the other hand, 
assuming attackers who do consider the actual added noise, we consider the risk to 
the dataset as a whole. Moreover, strong attackers can be assumed to use the inverse 
function of the actual noise or anonymization method. In the case that noise based on 
anormal distribution is added, for instance, an optimal distance-based record linkage 
can be performed [46]. 
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It is important to consider the various types of attackers in this way, because the 
most important factor of privacy is the inability to definitely link an anonymized 
record X’ and original record X. Our metrics ensure that the attackers considered 
can neither identify a record nor make an identification by chance, by considering 
many attackers. 


4.3.2.6 Implementation of the Analysis Algorithm 


Processing time is a problem when our metric is applied to a large dataset. In this 
section, we discuss this problem. 

First, we have to evaluate the risk from attackers with each record, and when 
sampling is implemented, the candidates in each record need to be preserved across 
the sampling. However, we do not need to store the candidates for every record or 
the records that have certain risks because the metric does not consider attackers 
who have knowledge of a record that does not have the highest risk. Moreover, 
when anonymization and evaluation are performed repeatedly, it takes a long time to 
evaluate the risk because the same number of attackers as the number of records are 
assumed. Thus, a threshold risk can be introduced to resolve the problem. When the 
risk of an attack does not exceed the threshold, attackers do not need to be evaluated. 
It is possible, however, that the risk may increase depending on the situation (see 
rs, re in Fig.4.2). Therefore, when a threshold is introduced, the accuracy of the 
privacy risk may worsen. We describe the pseudocode of risk analysis as follows: 


Algorithm 4.5 (Do, D1, A, fsim) : Risk analysis. 

Input: Original dataset Dp, Anonymized dataset Dı, Adversary model A, and attack simulator 
Ssim 

: while Wr? € Do do 

pi < simulation attack(r?, Di, A, fsim) 

: end while 

: p < max(p;) 

: N < count(max(p;)) 

: return p, N 


Second, the attackers do not have to compare their records with every record 
because the method of evaluation is similar to that of k-anonymity, and the attackers 
only need to compare a representative of each group. The attackers need to com- 
pare their records with {-30, 175-}, {31-,-174}, and {31-, 175-} in (b) of Fig. 4.3, for 
instance. However, when the levels of generalization are different, such methods 
cannot be applied, and every record should be checked. To solve the problem, we 
first count the number of values of each attribute and then compare each attribute of 
r? with that of each record of D; in accordance with the large number of varieties. 

Finally, when the procedure for anonymization is known in advance, it is possible 
to perform the evaluation more quickly by considering the effect of the initial part of 
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the anonymization. For instance, in Fig. 4.3a, we only have to consider cells whose 
values do not exceed 30 in ATT R; or fall short of 174 in ATT Rp. 


4.3.3 Experiment 


4.3.3.1 Experimental Environments 


We conducted experiments to evaluate the validity of the proposed metrics. We mea- 
sured the time to output the risk and confirmed that the privacy metric was appropri- 
ate. We used three parameters, k, 6, €, for comparison and verified the relationships 
among k-anonymity, sampling, and noise addition. We implemented our risk analysis 
method on a PC with an Intel Core 17-4790 3.6-GHz CPU and a 16.0-GB memory. 


4.3.3.2 Dataset and Adversary Model 


We used a pseudomedical dataset based on an actual medical dataset. The dataset 
had 10,000 records and two attributes, total cholesterol (TC) and HbAIc, and the 
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distribution of each attribute is shown in Figs. 4.5 and 4.6. We first measured the com- 
putation time while changing the number of records and then evaluated the validity 
of our metrics while changing the parameters of each anonymization method. Noise 
addition, generalization, and sampling were used as representative anonymization 
methods, and we adopted the Mondrian algorithm [9] for k-anonymization, Laplace 
noise for noise addition, and random sampling for sampling. We assumed reiden- 
tifying adversary A, to A4. The conditions of the attacker models are the same as 
those of Sect. 4.3.2.4 except for noise addition. We define the fnoise of the Az and 
Ag output records, whose value for each attribute differed by 5% from the original 
value, to be candidates. 


4.3.4 Results 


4.3.4.1 Computational Complexity 


Our proposed privacy metrics are intended to be able to applied to large datasets. 
We measured the execution time by changing the number of records (Table 4.1) and 
parameters (Table 4.2, 4.3 and 4.4). 

It takes little time to evaluate the risk when simple attackers, such as A; and 
A3, are considered. On the other hand, when reflective attackers are assumed, the 
number of calculations increases and more time is required for evaluation. However, 
some of the processing described above reduces the time. For instance, the number 
of combinations of attributes increases with increasing numbers of records, and once 
an attacker has checked the risk of a record, that attacker does not have to calculate 
the risk of other records that have the same values. Therefore, the analysis algorithm 
is appropriate for large datasets. 


Table 4.1 Execution time 


# of records A (ms) A2 (ms) A3 (ms) A4 (ms) 
1000 1.8 699.6 131.8 569.0 
5000 2.6 17,005.6 751.2 8,920.8 
10000 4.7 32,764.2 1,361.6 12,925.5 


B A, (ms) A2 (ms) A3 (ms) A4 (ms) 
0.05 2.6 17,005.6 751.2 8,920.8 
0.10 1.2 18,950.8 512.8 5,084.8 


0.30 2.0 26,715.4 139.2 8,285.4 
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Table 4.3 The case of 8 = 0.05, k = 2 


€ Aı (ms) A2 (ms) A3 (ms) A4 (ms) 
0.5 2.6 17,005.6 751.2 8,920.8 
1.0 1.4 17,002.4 628.6 9,256.4 
3.0 1.6 16,894.8 945.0 8,968.2 


Table 4.4 The case of 6 = 0.05, € = 0.5 


k Ay (ms) A2 (ms) A3 (ms) A4 (ms) 
2 2.6 17,005.6 751.2 8,920.8 
3 2.9 16,828.6 744.2 8,788.4 
4 2.8 17,211.9 755.8 9,013.1 


Table 4.5 Relationship among parameters and our metrics (p, N) 


k=2 B 
0.05 0.1 0.3 
€ 0.1 (0.0196, 1) (0.0303, 2) (0.0909, 1) 
0.5 (0.0204, 1) (0.0250, 1) (0.1000, 1) 
1.0 (0.0208, 1) (0.0278, 1) (0.1000, 1) 


When the sampling rate is changed, the computation time differs depending on 
the attacker. This is because there are two loop processes, one for sampled records 
and one for nonsampled records, and the calculation methods of each process differ 
depending on the attacker. 

The effect of noise addition on computation time is not different in this experiment, 
but when a very large amount of noise is added, the distribution of the records is 
uniform and the different kinds of records increase; as a result, the computation time 
may increase. 

The effect of kK-anonymity also seems minimal, but when k is large the number 
of different types of records decreases and the computation time may decrease. 


Validation 
We observed p and N by changing the sampling rate £ and the noise parameter € to 
verify the validity of our metrics. We evaluated the attacker model A, while changing 
the parameters k, £, and e. The evaluation result is shown below (Table 4.5, 4.6). 
The risk to privacy decreases as k increases and as f and e decrease, and the 
risk is a valid privacy metric. Sampling rates are the key factor that reduces the risk 
in this experiment. There are some outliers in the datasets, and they are the cause 
of the risk. In fact, if such records are not sampled, the privacy risk decreases. We 
conducted this experiment multiple times, and the result was different each time. 
Table 4.7 presents a sample of the evaluation results. Some outliers were included in 
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Table 4.6 Relationship among parameters and our metrics (p, N) 

k=4 b 

0.05 0.1 0.3 

€ : (0.0154, 1) (0.0270, 2) (0.0667, 2) 
(0.0192, 1) (0.0227, 2) (0.0625, 3) 
(0.0200, 1) (0.0238, 2) (0.0625, 1) 


Table 4.7 Case of $ = 0.05, € = 1.0 


Times A4 

1 (1.0000,3) (0.0035,1) (0.0083,1) (0.0049,1) 
2 (1.0000,2) (0.0013,4) (0.0108, 1) (0.0035,1) 
3 (1.0000,4) (0.0217,1) (0.1667,1) (0.0204, 1) 
4 (0.5000,5) (0.0030, 1) (0.0667, 1) (0.0050, 1) 
5 (1.0000,5) (0.0032,1) (0.0294, 1) (0.0051,1) 


the third operation, and the risk was higher than that of other operations. Therefore, 
the key factor may change when outliers are removed in advance. 


4.4 Extension to Time-Sequence Data 


4.4.1 Privacy Definition 


We define two types of attack models for time-sequence datasets. The first, a reiden- 
tification attack, is a general attack model where an attacker has information on the 
original dataset M and tries to reidentify it in an anonymized dataset A(M). This 
model assumes that an attacker has maximal information about the original dataset. 
This model is the same as that of k-anonymization, where even if an attacker has an 
original dataset, the probability of the reidentification of a k-anonymized dataset is 
1/k. 


Definition 4.10 (Reidentification attack) Let an attacker have a matrix M, € R”*™ 
and an anonymized matrix A(M,,) € R”*™. A reidentification attack against a record 
ri succeeds if record r; € M, is linked to record r € A(M,,), where r; and r are 
the same user. 


A linkage attack, which is an attack on a valid user, is one in which an attacker 
tries to obtain information from the given datasets A(M,;,) and A(M,,). A(M,,) and 
A(M,,) are assumed to include the same users, but the primary keys are different. 
An attacker in this model has only anonymized datasets, so a valid user is assumed 
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Fig. 4.7 Example of a risk 
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to be an attacker in this model. There are few studies concerning this problem, and 
we evaluate the risk using actual datasets in this paper. 


Definition 4.11 (Linkage attack) Let an attacker have two anonymized matrices, 
A(M,,) € R”*™ and A(M,,) € R”*™. M, and M, include the same users and items, 
where each user and item of M,, are the same as those of M,,. A linkage attack against 
a recond r; succeeds if record r; € A(M,,) is linked to record ry € A(M,,), where r; 
and ry are the same user. 


We next define the privacy metric as follows: 


Definition 4.129 (Privacy metric) Let n be the total number of users of a dataset M 
and n’ be the number of users that are successfully attacked. The privacy risk of M 
is defined as E, 


We consider the attacks to be the same as the previous ones to solve an assignment 
problem. An assignment problem is to find an appropriate task assignment when 
there are n users and tasks, and the Hungarian algorithm [47] solves the assignment 
problem in such a way that the entire cost is minimal. 

We apply the same algorithm as used for reidentification and linkage attacks and 
assume that when an attacker assigns a record to the correct user, the attack succeeds. 
When a dataset is k-anonymized, there are at least k — 1 of the same records. Hence, 
when a record is assigned to the cluster to which the correct record belongs to, 
we regard the record as being assigned correctly even if the assigned record is not 
actually correct. Furthermore, we define the privacy metric as the result obtained by 
multiplying the probability, and we define 1/k because the probability is the ratio of 
correctly assigned clusters (Fig. 4.7). 

Figure 4.1 shows an example of a risk evaluation. The dataset on the left is the 
original dataset and that on the right is the anonymized dataset. The arrows indicate 
the assignment result. User 2 of the original dataset, for instance, is assigned to user 
3 of the anonymized dataset, so the attack on user 2 fails. When noise addition is 
used as the anonymization method, users 2, 3, 4, and 5 are assigned to the wrong 
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users and the privacy risk is 3/7. On the other hand, when k-anonymization is used, 
in this case, k = 2, users 4 and 5 are assigned to the wrong users (blue arrows) but 
are assigned to the clusters that are the same as those of the correct users. Therefore, 
we consider the attacks on users 4 and 5 to be successful. The failed attacks are only 
for users 2 and 3 (red arrows), and the privacy risk is 5/7 x 1/2 = 5/14. 


4.4.2 Utility Definition 


We define the utility metric here. In previous research, most utility metrics are based 
on either the distance between the original dataset and the anonymized dataset, or the 
amount of information loss [48, 49]. However, the utility depends on the situation 
(i.e., context and use case), and these metrics do not necessarily match the actual 
utility. Therefore, we consider a use case scenario and present a utility definition that 
matches the scenario. Specifically, we consider a use case in which an anonymized 
dataset is used as training data for a machine learning algorithm. In the case of a 
Web access log dataset, for example, a client, who is a developer of an anti-virus 
software, may generate a machine learning model from an anonymized dataset and 
predict whether their user will access a phishing Web site. 


Definition 4.13 (Utility metric) Let F(M, E) be the F-measure of a machine learn- 
ing model, where the training data are M and the test data are E. The utility metric 
is defined as follows: 


_ F(A(M), E) 


Figure4.8 gives an overview of the utility evaluation. We first generate two 
machine learning models: One is from an original dataset, and the other is from 
its anonymized dataset. An item is randomly chosen as an objective variable, and the 
remaining items are explanation variables. Then, we use these models and predict an 
attribute of each record of an evaluation dataset that has the same attributes as those 
of the original dataset. This operation is performed several times while an objective 
variable is changed. The utility is defined as the average of the ratio of the F-measure 
of a model of the anonymized dataset to that of a model of the corresponding original 
dataset. In this paper, we apply logistic regression as the machine learning algorithm 
and predict fifty attributes. 


4.4.3 Matrix Factorization 


Matrix factorization is a fundamental task in data analysis, and the technique is used 
in various scenarios, such as text data mining, acoustic analysis, and product recom- 
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Fig. 4.8 Overview of utility 
evaluation 
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mendation by collaborative filtering. We use matrix factorization as an anonymization 
technique, so we present an overview of matrix factorization in this section. 


4.4.3.1 SGD Matrix Factorization 


We consider an unknown rank-r matrix M € R”*” and assume that we know a set 
of elements Q C [n] x [m]. Po(M) € R”*” is defined as: 


Mi if(i, j) € Q, 
Po(M) = 4.2 
o(M) (0) otherwise. (aa 


The goal of matrix factorization is to find two matrices U € R”*” and V € R’*” 
which approximate the original matrix M;; ~ X;; s.t. WMj; € Q(M) with lower 
dimensionality r << min(n, m). Here, X = UTV. 

This problem is defined to solve the following optimization problem: 


min JO (My =u} vp? +All? + livi), (4.3) 


(i, j)E Pa (M) 


where u; is a vector of user factors and v; is a vector of item factors. When u; and v; are 
variables, this function is not a convex set, so the problem described above cannot be 
solved. Some techniques are proposed to solve the problem, and gradient descent [50], 
for example, is a fundamental technique to find a local minimum value. However, 
gradient descent needs to update vectors iteratively to obtain an optimal solution and 
using gradient descent is computationally expensive, so stochastic gradient descent 
(SGD) is widely used, for example, in the KDD Cup 2011 [51] and the Netflix 
Prize [52]. 

There has been some research to speed up SGD-based matrix factorization, such 
as [53—56], and each algorithm updates the matrices in parallel or in a distributed 
manner. 
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In this paper, we apply a simple SGD technique to optimize formula (2) and denote 
Update(A) as the update of a matrix A using the SGD technique. 


4.4.4 Anonymization Using Matrix Factorization 


We consider matrix factorization to be an anonymization method, and rank r 
contributes to the accuracy of the matrix approximation. Moreover, we propose 
combining matrix factorization with another anonymization method ano, such as 
k-anonymization or noise addition. We denote p as a parameter of the anonymiza- 
tion method, and p is k or @ in this paper. A basis matrix U and weighting matrix 
V can be assumed to be the characteristics of the rows and columns, respectively, 
and U is a characteristic matrix of users in our dataset. Therefore, we propose to 
anonymize U and maintain V so that the characteristics of the domain are preserved. 
In our algorithm, we first divide the dataset M into U and V, and anonymize U. 
Then, we optimize V once and recombine it with the anonymized U. The algorithm 
is described below. 

We indicate that A,(D) applies matrix factorization to matrix D and that 
Acano,r)(D) combines matrix factorization and the anonymization method ano by: 


Acano,r)(D) = (A (ano) (U))' V, where U ER””, VeR™”. (4.4) 


Algorithm 4.6 (M,r, I, ano, p): Anonymization using Matrix Factorization 
Input: Original dataset M, rank r, and the number of iterations T. 
1:t=0 

2: Construct U; € [0, 1]”*" and V; € [0, 1]”*” randomly 

: while t < 7 do 

Ui41 = Update(U;) 

Vi+1 = Update(V;) 

t=t4+1 

: end while 

: Ul = A(ano)(Ur41) 

: return X = UVa 
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Table 4.8 Dataset format 


ID (= i) Date URL (= j) 

xy (=D 2016-12-01 16:13:48 www.google.com (= 1) 

Yn (= 2) 2016-12-01 16:15:14 www.mail.google.com (= 2) 
Xt 2016-12-01 16:17:13 www.youtube.com (= 3) 

zn (=3) 2016-12-01 16:19:01 www.facebook.com (= 4) 
Xp (= 1) 2016-12-01 16:21:15 www.youtube.com 

Xn 2016-12-01 16:22:42 www.google.com 

Zn (= 3) 2016-12-01 16:25:01 www.youtube.com 


4.4.5 Experiment 


4.4.5.1 Dataset 


We use an actual Web access log dataset as a time-sequence dataset. The dataset 
consists of an ID, a time stamp, and the access domain, as shown in Table4.8. We 
convert the dataset into a matrix as follows: 


Fii F12 +t t Fim 
F21 F22 >> F2m 

Mr=|..... (4.5) 
Tai Tn2°** Tam 


Here, T is the observation time. 
We say r;; = 1 if a user whose ID is i accesses domain j during time T, and 
otherwise, r;; = 0. For example, we construct the datasets in Table 4.8 as follows: 


1010 
M,, = |0100 (4.6) 
0001 


1010 
M,, =|0000 (4.7) 
0010 


Here, t; is the 10-min span between 2016-12-01 16:10:00 and 2016-12-01 
16:19:59, and t is the similar 10-min span between 2016-12-01 16:20:00 and 2016- 
12-01 16:29:59. The IDs are different between t; and ty, but x, and x;,, and z,, and 
Zn represent the same users. 
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Table 4.9 Linkage attack against a non-anonymized dataset 


Observation time (h) Linkage attack probability 
2 0.51 
4 0.64 
8 0.80 


In the following experiments, we chose randomly 200 users and 1,000 domains 
from an actual Web access log and let the pseudonymous ID be changed at each 
designated time T. 


4.4.5.2 The Privacy Risk Against a Linkage Attack 


First, we evaluate whether a linkage attack is possible. We set the observation time 
tı as 2, 4, and 8 h from 16:00 on a weekday and the observation time t as the same 
time on another weekday. The probability of a linkage attack between M, and M, 
is shown in Table 4.9. 

The matrix only includes information on whether a domain has been accessed, and 
even if the observation time is 2 h, the linkage attack probability, i.e., risk, is very high 
(over 50%). Moreover, the risk increases as the observation time increases because 
when the observation time increases, the trend of a user becomes noticeable. The 
result shows that the pattern of Web access for people has consistent characteristics. 
Hence, we need to consider not only reidentification attacks but also linkage attacks 
to avoid privacy leakages. 


4.4.5.3 Effects of Matrix Factorization 


Observation times f; and fz are fixed as 8 h from 16:00 h on a weekday in the following 
experiments. The inputs of matrix factorization are the original dataset M , the number 
of iterations J, and the rank r. Furthermore, à and y and are the hyperparameters. We 
fix Z = 100, which is enough to converge, y = 0.05, and à = 0.01. The convergence 
result is shown in Fig. 4.9. The rank r can be treated as the parameter of anonymization 
by matrix factorization because the accuracy of dataset X = U VT depends on the 
rank r, sor is the parameter of our algorithm; we setr = 10, 20, 30, 40. We set larger 
values in the experiments in [3], but the results of the case r > 40 are saturated. The 
probabilities of reidentification and linkage attacks are shown in Table 4.10. 

The results show that matrix factorization itself does not have much effect on 
reidentification attacks. Note that matrix factorization can preserve the relative posi- 
tional relationship among the records so that the privacy risk of the reidentification 
attack does not decrease much by using a matching algorithm. When the rank is 
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Fig. 4.9 Convergence result 
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Table 4.10 Attacks against matrix factorization 


Rank Reidentification attack Linkage attack 
10 0.98 0.31 
20 1.00 0.45 
30 1.00 0.54 
40 1.00 0.58 


Fig. 4.10 Overview of the 
experiment 
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small enough, r = 10, the positional relationship is broken, and the privacy risk is 


On the other hand, compared with the reidentification attack presented in Table 4.9, 
the linkage attack probability between A, (M, ) and A, (M, ) is better. This is because 
the relationship between the records of M,, and M, is weaker than that between M, 
and A,(M,,). In our experiment, the dataset of the observation time is 8h andr = 30 
has almost the same privacy level as when the observation time is 2 h (Fig. 4.10). 
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Table 4.11 Experiment 1 


k Reidentification attack Linkage attack 
2 0.500 0.185 
4 0.250 0.050 
6 0.167 0.038 
8 0.125 0.027 
10 0.098 0.023 


4.4.6 Results 


4.4.6.1 Risk Evaluation 


We evaluate our anonymization method, Algorithm 4.1, in the following experiments. 
We apply the method described in [10] as k-anonymization and Laplace noise as the 
noise addition. When noise addition is applied, noise € ~ Lap(0, 27) is added to 
each element, and the parameter is @. 


1. Evaluate the privacy risk of a reidentification attack between A;(M,,) and M, 
and a linkage attack between A; (M,,) and A;(M,,). 

2. Evaluate the privacy risk of a reidentification attack between Ag(M,,) and M, 
and a linkage attack between Ay(M,,) and Ag(M,,). 

3. Evaluate the privacy risk of reidentification attacks between A, (U;, Tv and M, 
and linkage attacks between A, (U, )TV and A(Un)' V. 

4. Evaluate the privacy risk of reidentification attacks between Ag (U, )'V and M, 
and linkage attacks between Ag (U, sly and Ag(U,)'V. 


The evaluations of the reidentification attacks in experiments | and 2 are almost 
the same as those conducted in many previous studies. The difference is the privacy 
metric (see 4.4.1), and these results are used for comparison with experiments 3 and 
4, which are evaluations of our algorithm. There are few studies on linkage attacks, 
and evaluations of this type of attack are one of our contributions. 

The evaluation of the reidentification attack in experiment | (Table 4.11) is simple, 
and the result is almost the same as for kK-anonymization. However, our privacy metric 
is slightly different from that for k-anonymity, so the result is also slightly different 
from 1/k. The result of the linkage attack also shows that k-anonymization can 
greatly improve the privacy of linkage attacks and that 2-anonymization can reduce 
the privacy risk by 77%(0.8 — 0.185). 

The evaluations of experiment 2 are shown in Table4.12. The privacy of the 
reidentification attack is improved from ¢ > 0.9, and when ¢ is large, for example, 
ġ = 1.5, the score appears to be good. However, almost half of the records are 
changed by more than 1 by the added noise, and each original value of M is 0 or 1, 
namely, Mj; € {0, 1}, so that the noise is too large to preserve utility. Therefore, we 
conclude that simple noise addition is not good, in terms of utility preservation, as an 
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Table 4.12 Experiment 2 


$ Reidentification attack Linkage attack 
0.3 1.00 0.33 
0.6 1.00 0.10 
0.9 0.95 0.01 
1.2 0.81 0.03 
1.5 0.62 0.00 


Table 4.13 Experiment 3: reidentification attack 

r=10 r = 20 r = 30 r=40 
0.50 
0.25 
0.16 
0.12 
0.08 


r=10 r= 20 r = 30 r = 40 
0.15 
0.07 
0.04 
0.03 
0.02 


anonymization method. On the other hand, we obtain an interesting result for linkage 
attacks. The privacy for linkage attacks is improved even if the noise is very small 
and adding even a small amount of noise is an effective countermeasure against a 
linkage attack. 

In experiment 3, we evaluate the effect of our proposed algorithm, which is a 
combination of matrix factorization and k-anonymization. Table4.13 presents the 
result of the reidentification attack. In the experiment, we cannot find the effect of 
the matrix factorization very well, but the privacy slightly improves as r increases. 
This is because k-anonymization has a large effect on the reidentification risk, and 
the effect of the matrix factorization does not appear. 

The results of the linkage attack in experiment 3 are shown in Table 4.14. In the 
experiment, we cannot obtain new knowledge about the effect of matrix factorization. 
When the datasets, which are observed at different time periods, are sufficiently 
anonymized by k-anonymization, there is no relationship among the same users of 
each dataset and only outliers can be linked. 
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Table 4.15 Experiment 4: reidentification attack 


$ r=10 r=20 r=30 r =40 
0.05 0.75 0.95 0.97 1.00 
0.10 0.42 0.72 0.85 0.86 
0.15 0.25 0.50 0.61 0.70 
0.20 0.18 0.28 0.40 0.49 


$ r=10 r=20 r=30 r= 40 
0.05 0.21 0.34 0.34 0.50 
0.10 0.12 0.15 0.14 0.20 
0.15 0.07 0.11 0.09 0.10 
0.20 0.03 0.03 0.03 0.02 
Fig. 4.11 Reidentification y 100 
risk of the combination of © 0.90 
matrix factorization and Š 0.80 
ise addition gore 
noise a 5 0.60 
G 0.50 
E 0.40 
5 0.30 
k- 0.20 
g 0.10 
0.00 
ANMTNORBMDAOHNMTNHORADO 
CEER- E-E-E- E-E iriri ir br En 
cococoococoococoocoocoococoocoococoocooo 
Noise parame! 


In experiment 4, we evaluate the impact of our method, which is a combination of 
matrix factorization and noise addition. The evaluation results of the reidentification 
attack are presented in Table 4.15. Noise is added to U, which is the user’s charac- 
teristics, and then, UT is multiplied by V. Therefore, we cannot simply compare the 
results with those of experiment 2, but the impact of the matrix factorization is high. 
This result shows that using matrix factorization can help to construct anonymized 
datasets flexibly from the viewpoint of privacy. For example, the privacy risk of 
A(¢=0.15,r=20) (Mr ) and A(g=0.20,r=40) (M; ) is almost the same as that of A g=2) (M, ) 
and A(@=1.5) (Ma ). 

The results of the linkage attack in experiment 4 are presented in Table 4.16. The 
trend is the same as that of the reidentification attack, and the matrix factorization is 
compatible with noise addition. We present the details of the results of the reidenti- 
fication attack and the linkage attack in Figs.4.11 and 4.12. 
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Table 4.17 Utility evaluation 1 


Dataset D Precision Recall F-measure Uti(D) 
A(k=2) (Mj, ) 0.981 
Ack=4) (Mr) 0.936 
A=6) (Mn) 0.946 
A(k=8) (Mn) 0.913 
Ak=10) (Mr) 0.932 


4.4.6.2 Utility Evaluation 


We next evaluate the utility of anonymized datasets. We evaluate the utility of 
datasets by applying a machine learning algorithm. Logistic regression (https://scikit- 
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) 
is applied in the following experiment, and the parameters are those of the default 
setting. One of the applications of an access log dataset is to predict a malicious site 
and inform the web browser’s users. Therefore, we use a machine learning algorithm 
and predict whether each user will access a malicious site. We generate learning 
models using the original (non-anonymized) dataset and the anonymized datasets 
and input the test dataset to these models. The utility score is defined in Definition 
4.13, and the F-measure of the model of the original dataset is 0.763. Each result of 
the evaluation is shown in Tables. 4.17, 4.18, 4.19, and 4.20. 


Evaluate the utility of Aœ) (M; ) for k = 2, 4, 6, 8, and 10. 

Evaluate the utility of Aq) (M, ) for ¢ = 0.3, 0.6, 0.9, 1.2, and 1.5. 

Evaluate the utility of A &=2,r) (M, ) for r = 10, 20, 30, and 40. 

Evaluate the utility of Avg ,)(M;,) for @ = 0.1 and 0.15 and r = 10, 20, 30, and 
40. 


eS 


In experiment 1, each element is M;; € {0, 1} and the matrix is sparse, even when 
k-anonymization is effective. However, when the dataset is more complex, the utility 
of k-anonymization will decrease; this is widely known as the curse of dimensionality. 
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Table 4.18 Utility evaluation 2 


99 


Dataset D Precision Recall F-measure Uti(D) 
A(@=0.3) (Mn ) 0.941 
A(@=0.6) (M1, ) 0.876 
A(o=0.9) (Mr ) 0.810 
A(g=1.2) (M1, ) 0.748 
A@=1.5)(Mn) 0.744 


Table 4.19 Utility evaluation 3 


Dataset D 


Precision 


Recall 


F-measure 


Uti(D) 


A(k=2,r=10)(Mr,) | 0.686 0.735 0.710 0.930 

A(k=2,r=20) (Mr, ) | 0.699 0.767 0.731 0.959 

A (k=2,r=30) (Mn ) | 0.695 0.773 0.732 0.960 

A(k=2,r=40)(M;,) | 0.712 0.786 0.747 0.980 
Table 4.20 Utility evaluation 4 

Dataset D Precision Recall F-measure Uti(D) 

A(g=0.10,r=10) (My, ) 0.742 0.650 0.693 0.909 

A (@=0.10,r=20) (Mj, ) 0.752 0.688 0.719 0.943 

A (@=0.10,r=30) (Mr, ) 0.736 0.703 0.719 0.943 

A(g=0.10,r=40) (Mz, ) 0.737 0.735 0.736 0.965 
Table 4.21 Utility evaluation 5 

Dataset D Precision Recall F-measure Uti(D) 

A (@=0.15,r=10) (Mr ) 0.718 0.614 0.662 0.868 

A(g=0.15,r=20) (Mi, ) 0.748 0.655 0.698 0.915 

A (@=0.15,r=30) (Mr ) 0.704 0.680 0.692 0.907 

A (@=0.15,r=40) (Mn ) 0.716 0.711 0.713 0.935 


The results of experiment 2 show that the utility of the dataset decreases as noise 
increases. As stated in the risk evaluation section, each element of the original dataset 
is O or 1, and the utility drastically worsens when the noise parameter is large, such 


as @ = 1.5. 


When k-anonymization and matrix factorization are combined, the effect of matrix 
factorization is small, as is the case for the privacy risk. In this experiment, the effect 
of k-anonymization is large, and the effect of matrix factorization is relatively small. 

The evaluation results of the combination of noise addition and matrix factor- 
ization show a good performance (Tables 4.20 and 4.21). A dataset generated by 
combining matrix factorization and noise addition has more utility than a dataset 
generated by noise addition when each dataset has the same privacy level. 
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Fig. 4.13 Anonymization and privacy risk evaluation tool 1 


4.5 Anonymization and Privacy Risk Evaluation Tool 


In this section, we introduce an anonymization and privacy risk evaluation tool. So 
far, we have shown how to evaluate the privacy and utility of several datasets. We 
focus on static datasets and apply the theory we have described in the tool. First, 
we explain the outline of the tool. The tool requires a dataset that is the target of 
anonymization and privacy risk evaluation. At this time, the data type is defined for 
each attribute (see Fig. 4.13). Numerical, qualitative, set, code, and sensitive types 
can be defined. Age, height, and weight are defined as numerical types, and a user 
can assign a range of values. For instance, a user may want to divide age into groups 
of two years or five years depending on the situation. Qualitative-type records have 
nonnumerical value, such as gender and occupation. The set type is an extended 
numerical or qualitative type, and attributes that include multiple data correspond 
to this type. The code type is defined when every value is the same digit, such as a 
postcode. The sensitive type corresponds to sensitive information. The privacy risk 
is evaluated using quasi-identifiers in our tool, and the attributes that are sensitive 
do not effect the privacy risk. However, it is known that sensitive information may 
cause privacy leakages, and the tool can cover the risk for sensitive information such 
as l-diversity. 

After the type of each attribute is decided, a user defines the noise and sampling 
parameters. Our tool can evaluate datasets that are anonymized by the combined 
method. Then, the user generates a hierarchical tree for each attribute, and the tool 
anonymizes the values in accordance with the tree. The user can generate and change 
the construction of hierarchical trees by using a UI (see Fig. 4.14.). 

After these preparations are finished, the user can define the conditions and gen- 
erate a dataset flexibly. A sample operation screen is shown in Fig.4.15. Let us 
introduce a method commonly used as an example. First, a user searches records that 
do not achieve k-anonymity. Namely, the user searches records that do not include 
more than k copies of the same record, and then the user changes the level of an 
attribute of the records. The records that are secure enough are not processed, so the 
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Fig. 4.16 Anonymization and privacy risk evaluation tool 4 


utility of the dataset can be maintained. The conditions can be more complex. For 
example, the records that have a value of “age” over 80 and a value of “occupation” 
that is not “self-employed” are identified and anonymized. The ranks of the records 
are “balanced” according to the hierarchical tree. The privacy risk can be seen in 
real time (in Fig. 4.16), and the user can anonymize a dataset by trial and error. The 
operation procedure can be output as a setting file, and once the operation is decided, 
the procedure can be performed automatically, such as in batch processing. 


4.6 Conclusion 


In this chapter, we considered the importance of data and privacy. Several anonymiza- 
tion techniques, including k-anonymization, are introduced in Sect. 4.2, and the pri- 
vacy and adversary model for static data are shown in Sect. 4.3. We focused on static 
data and time-sequence data in this project, and we discuss time-sequence data in 
Sect. 4.4. Finally, in Sect.4.5, we introduce an anonymization and privacy risk eval- 
uation tool. The tool is partly developed in this project, and we are proactive in using 
it commercially. 
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Chapter 5 A) 
Living Safety Testbed Group get 


Koji Kitamura and Yoshifumi Nishida 


Abstract Safety technology for everyday activities is strongly needed for children, 
the elderly, and persons with disabilities. However, it is difficult to understand prob- 
lems related to everyday life from injury data, medical data, and so on because 
such data are distributed over multiple organizations and cannot easily be shared or 
integrated due to privacy protection concerns. To address this issue, our project is 
developing technologies for integrating and utilizing multi-organizational distributed 
big data based on security technology. The authors research school safety based on 
the developed technologies. In this chapter, the authors describe a trend analysis tech- 
nology for time series injury data, a cliff analysis technology for extracting serious 
injury situation, and child behavior prediction technology as the necessary functions 
for finding and predicting serious injuries and evaluating the effectiveness of an 
intervention. We also present some analysis examples using the developed function. 
Furthermore, we describe some social implementation projects for injury prevention 
for the serious injuries found by analyzing injury data using our developed system. 


5.1 Necessity of Living Safety 


Community safety is highly desirable for children, the elderly, persons with disabil- 
ities, and others with special needs for functional support in daily life. People with 
variances in the functions of daily life experience insufficiencies in bodily or cognitive 
function under conditions or environments that had previously been problem-free. 
Risk arises at certain times, and maintenance of their safety through their own care 
or the care of people around them is thereafter difficult. It is accordingly important 
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to seek out data that will serve as a basis for identification of states of risk and related 
conditions, implement effective corrective measures, and verify the results. 

In the realm of community safety, historical data on the past accidents and thera- 
pies commonly exist in a state of dispersion among many different organizations, and 
it is therefore difficult to determine the total number of accidents that have occurred 
and gain an overall perspective extending from cause of accident to resulting injury. 
If relevant data held at many different organizations can be integrated and utilized, 
this may then lead to problem identification and effective solution based on the data. 

In actuality, sharing and integration of data across institutions is difficult because 
of the need to protect information on individuals, maintain privacy, prevent infor- 
mation leakage, and other needs. So long as non-engagement in active sharing and 
integration of such data remains blameless, it will tend to discourage advancement of 
community safety. In this light, we are now engaged in advancing the development 
of technology for utilization and application of multi-organizational dispersed data 
using security-based technology, in a Japan Science and Technology Agency (JST) 
CREST (systemization of the security base technology for expediting/accelerating 
of/for big data integration and utilization) project. The research group of the authors 
is working in collaboration with data-holding medical/therapy organizations and 
with product design and other data-user sites to develop technology for effective 
utilization of organizationally dispersed data. To date, in collaboration with Fire and 
Disaster Management Agency, Japan Sport Council, multiple medical institutions, 
nursery, elementary, and junior high schools, and other entities, we have advanced the 
development of technology for integration and utilization of dispersed injury-related 
data. 

With school safety as a specific field of application, we are engaged in proof 
of concept and system by demonstration. So far, we have compiled medical cost 
and other KPI-bearing big data from accident data dispersed in multiple elemen- 
tary schools, performed presumed integration without specifying the schools, and 
conceived and developed a serious injury accident analysis system using the multi- 
party private set intersection (PSI) protocol privacy-preserving information-sharing 
technique and severity cliff analysis technology, for analysis of the main accidents 
causing severe injury, and verified the system effectiveness by applying it to actual 
data. With this system, the analyst identifies the task to be performed at the school 
site and presumably has it applied as a preventive measure. 

In the present report, we describe the on-site use of the proposed system for task 
identification focused on temporal changes that becomes necessary and function 
expansion and application to actual data in intervention results evaluation. We also 
report on actual utilization of the system and on identified tasks as we engaged in 
acquisition and analysis of fine data necessary for injury prevention. 
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5.2 Overview of Test Bed System for Living Safety 


For problem identification and solution, a system of privacy preservation is nec- 
essary to permit sharing and integration of data held by multiple organizations. It 
also requires an analytical method of obtaining useful information from the shared 
and integrated data. One method for this purpose is embodied in the JST CREST 
(systemization of the security base technology for expediting/accelerating of/for big 
data integration and utilization) project in which the authors participate, have devel- 
oped the dataset (PSI: private set intersection) computation technology that preserves 
privacy, and have proposed a system including the severity cliff analysis technology. 

PSI technology enables extraction of intersections in relation to specified data 
items left uncoded and held by multiple organizations. With its utilization, accident 
information meeting conditions specified by the user can be provided to the user 
in an integrated state while leaving concealed the identity of the school where the 
accident occurred. 

The severity cliff analysis technology provides a means of analyzing the cause 
of severe injury accidents by seeing medical cost as severity. It enables analysis of 
the severity of accidents occurring in similar circumstances, location of the point of 
departure between cases of high and low severity, and differences between accidents 
with severe and slight injury, thus enabling causal analysis of accidents involving 
severe injury. 
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Fig. 5.1 System for sharing and analyzing life-safety-related data with secure function 
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In combination, these two technologies can be used to integrate information 
on accidents in multiple school environments while preserving privacy, identify 
severe injury accidents from the integrated accident data, and analyze their causes. 
More specifically, we have conceived and developed a system as shown in Fig.5.1. 
Accident-related information (e.g., grade, sex, and accident and injury categories) 
desired by the user is entered and criteria-meeting injury data from multiple schools 
are acquired and integrated. Severity cliff analysis is then applied to the accident 
circumstances described by textual data accompanying the acquired injury data, thus 
enabling determination of the severe injury accidents for the specified accident cir- 
cumstances and analysis of the cause. 


5.3 Severity Cliff Analysis of School Injury 


5.3.1 Development of Severity Cliff Analysis System 


5.3.1.1 System Overview 


As shown schematically in Fig.5.2, the developed severity cliff analysis system 
comprises four functions: accident circumstance registration, similar accident cir- 
cumstance search, severe injury accident search, and severity cliff analysis. These 
functions are described in detail in the following corresponding subsections. 


System configuration diagram 
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Fig. 5.2 System configuration for cliff analysis 


5 Living Safety Testbed Group 111 
5.3.1.2 Accident Circumstance Registration 


The accident circumstance registration function assigns the accident circumstance 
feature values to the accident circumstances present in the accident database. The 
accident database is first subjected to morphological analysis of text representing 
accident circumstances in order to extract the nouns and verbs. In this analysis, the 
Japanese concept dictionary (Japanese WordNet) is used to consolidate the noun 
and verb orthographic variants. Important words are next extracted with TF-IDF 
weighting of each. In the present study, words with high TF-IDF values were selected 
as representing accident circumstance feature values. These accident circumstance 
feature values are assigned to the accident samples in order to construct the accident 
database with assigned feature values. 


5.3.1.3 Similar Accident Circumstance Search 


With this second function, the accident circumstances registered by the first func- 
tion for their assigned feature values are sorted into similar accident circumstance 
groups. Clustering is performed using the Euclidean distance of the accident circum- 
stance feature value vectors assigned in the accident database. The optimum cluster 
number is determined with the gap statistic value resulting from the cluster number 
assessment. Figure 5.3 shows the results of sorting the accident database into similar 
accident circumstances. 


Fig. 5.3 Clustering of injury 
cases 
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Fig. 5.4 Example of severe injury analysis of injuries occurring under similar situations 


5.3.1.4 Severe Injury Accident Search 


The medical costs included in the accident database were used to identify severe 
injury accidents, with medical cost presumed high for severe injury accidents. 
Figure 5.4 shows medical cost in decreasing order for injuries occurring under similar 
circumstances. As shown, medical cost may differ substantially even for accidents 
occurring in similar circumstances, and cliffs marked by specific changes may exist. 
This indicates that severe injury accidents can be identified by focusing on specific 
differences in medical cost. 


5.3.1.5 Severity Cliff Analysis 


Figure 5.5 shows the relation between degree of circumstance similarity and medical 
cost in similar states of accident, where the degree of circumstance similarity is the 
degree of cosine similarity in comparison with the highest medical cost accident 
cases (severe injury accident cases). Figure 5.6 shows the three-dimensional graph 
obtained on addition of frequency to the graph. Similarity 1.0 denotes the highest 
similarity. With these graphs, comparison of severe injury and slight injury accidents 
under similar circumstances enables performance of severity cliff analysis focused 
on the difference between severe injury and slight injury accidents. 
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Fig. 5.5 Relationship 
between similarity and cost 
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5.3.2 Severity Cliff Analysis 


To test the effectiveness of the developed method when applied to investigating the 
causes of actual severe injury accidents, we used the accident data of 19,948 cases 
from the Injury and Accident Mutual Aid Benefit System for multiple junior high 
schools gathered by the Japan Sport Council. 

We performed the cliff analysis for similar accident circumstances with the rela- 
tion shown between similarity degree and medical cost as shown in Fig. 5.7. Figure 5.8 
shows the graph of Fig.5.7 with frequency added. 

The severe injury accidents in the similarity range of 1.0-0.6 in Fig.5.8 were as 
follows: 


e Strongly impacted and injured right shoulder in fall on contact with opponent 
during soccer match. (first-year junior high school, bone fracture, £174,504) 

e In competing for ball with opponent, on contact with that opponent fell from the 
left side, impacting with the ground and injuring the left clavicle. (third-year junior 
high school, bone fracture, ¥154,475) 

e In competing for the ball with an opponent, encountered strong contact and fell 
over from the right shoulder, thereby strongly impacting the right shoulder on the 
ground and fracturing the right clavicle. (3rd year junior high school, ¥147,297) 


In the same similarity range, the slight injury accidents were as follows: 


Fig. 5.7 Relationship 
between similarity and cost 
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Fig. 5.8 Relationships 02 04 06 08 1 
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At afternoon homeroom starting time, in carrying a bag from a locker and returning 
to a seat, the student tripped over the extended leg of a nearby student and fell, 
impacting his/her jaw on the leg of a desk and injuring a finger on the left hand. 
(second-year junior high school, contusion/bruise, ¥3,452) 

In recess from third class hour, while walking and conversing with a friend, tripped 
and fell at entrance to classroom with hands in pockets and therefore impacting 
jaw on floor. (second-year junior high school, dislocation, ¥3,152) 

Innoon recess, while walking in a corridor, tripped on a friend’s leg and fell, impact- 
ing right eye on wall. (first-year junior high school, contusion/bruise, £2,984) 

In classroom before start of class, collided with a friend and fell, impacting face 
on floor. (first-year junior high school, bone fracture, ¥2,476) 

While cleaning, engaged in shoving match with friend and fell with left elbow 
impacting floor. (third-year junior high school, contusion/bruise, ¥2,256) 


Similarity 


In summary, it was found that severe injury accidents occurred in a soccer match 


in contacting an opponent and falling, in competing with an opponent for the ball and 
contacting the opponent and falling, and in competing for the ball with an opponent 
and encountering strong contact and falling over and thus, all during soccer matches 
in contact with an opponent and falling, whereas slight injuries occurred in tripping 
over someone’s leg and falling, tripping and falling at an entrance, tripping on a 
friend’s leg and falling, colliding with a friend and falling, and engaging in a shoving 
match and falling and thus were all in tripping on or colliding with something or 
someone and falling. Taken together, the results show that among similar instances in 
a circumstance of tripping and falling, severe injuries more readily occur in colliding 
with an opponent and falling in a soccer match. 
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Let us next consider the severe injury accidents in the similarity range of 1.0-0.7 
in the clusters shown in Figs.5.9 and 5.10, which were as follows. 


e In a “soft tennis” club morning practice session, a ball came flying unseen and 
unevaded, directly striking a student in the right eye. (first-year junior high school, 
contusion/bruise, ¥48,740) 
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In a “soft tennis” club morning practice session, while throwing ball on the tennis 
court for a two-step hit, a student was struck in the left eye by a ball hit by an oppo- 
nent, suffering a bruised left eyeball, left retinal tear, and left eye conjunctivitis. 
(first-year junior high school, contusion/bruise, £42,416) 

In a softball club activity, a student playing catch with a third-year student on the 
school ground was struck in the face by the ball after losing sight of it and having 
it hit his/her own glove. (second-year junior high school, bone fracture, ¥18,112) 


In the same similarity range, the slight injuries were as follows: 


While a student was playing handball in a physical education class on the school- 
yard in the fourth school period, during the match, the ball came flying toward 
the student, who tried to catch it but mistakenly was struck by the ball on the ring 
finger of the left hand. (second-year junior high school, bone fracture, ¥6,304) 
In a volleyball activity, when boys and girls were practicing hitting serves over 
the net, the ball hit by a boy struck a student on the left thumb, breaking a bone. 
(second-year junior high school, bone fracture, £4,484) 

During a volleyball club tournament in a seaside region, in a practice serve at a 
gymnasium before a match, a hit ball from the opposite side of the court struck 
and injured the right hand of the student. (third-year junior high school, contu- 
sion/bruise, ¥3,492) 

During dribbling practice in a club activity at a gymnasium, a ball bounced off 
the leg of a club member and struck the right hand and sprained the thumb of the 
student. (second-year junior high school, sprain, ¥2,164) 

During bunting practice of the baseball club after class, in a bunting attempt, the 
bat was mispositioned and the ball struck the thumb of the right hand. (third-year 
junior high school, sprain, 2,032) 


Concerning these severe and slight injury accidents, in summary, it was found 
that the severe injury accidents involved a tennis ball that came flying and struck 
the right eye, a tennis ball hit by an opponent that struck the eye, and softball a ball 
striking the eye, and thus, all involving an eye being struck by a ball, whereas that the 
slight injury accidents involved a handball striking a finger, a hit volleyball striking 
a thumb, a basketball striking the right hand, and a baseball striking the right thumb, 
and thus, all involving a ball striking a hand or leg. These findings clearly show 
that, for accident circumstances in which a ball similarly strikes the body, those in 
which the ball strikes an eye tend to result in severe injury. This in turn indicates the 
existence of certain parts of the body and types of sports for which injuries tend to 
be serious and for which a preventive measure such as an eye protector is seldom 
implemented but necessary. 
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5.4 Trend Analysis of School Injury 


5.4.1 Trend Analysis for Evaluating Intervention 


Annual trends can provide an effective perspective in the search for problems that 
need to be solved. Examples include accidents that have sharply increased in recent 
years and cases that have been large in number with no change over many years, 
which may represent problems requiring consideration of preventive measures. It is 
also important to focus on annual trends when assessing the effects of measures or 
interventions. In this light, we have developed a trend analysis function that can be 
integrated and applied in combination with the previously developed severe injury 
accident analysis system. It has thus become possible to analyze changes in trends 
focused on circumstances and on verbal words characteristic of accident occurrence. 


5.4.2 Analysis of Judo Accident 


We have applied this trend analysis function to analyze data on 60,300 senior high 
school cases among 152,695 cases of judo-related injury included in the Injury and 
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Fig. 5.11 Analysis of judo accident trends relative to judo techniques 
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Fig. 5.12 Analysis of trends in injuries in judo accidents 


Accident Mutual Aid Benefit System data of the Japan Sport Council from 2008 to 
2015. 

Figure 5.11 shows the results of an analysis of trends in judo techniques as related 
to accidents, and Fig. 5.12 shows the results of an analysis of trends in injuries due to 
judo accidents. A publication on judo accidents was issued in 2013, leading to their 
recognition as a social problem, issuance of a related alert, and notification of the 
risks of shoulder throwing and major outer reaping in particular. A manual on safe 
teaching methods was also produced and on-site initiatives were implemented. All 
of these apparently had considerable effect. 

A marked decrease from 2013 in instances of accident-related shoulder throwing 
was confirmed by the authors, but they also found that no clear reduction occurred in 
major outer reap accidents. With this trend analysis, it is thus possible to assess the 
effects of intervention important for injury prevention. Application of the analysis to 
moderate injuries showed sharp reductions in contusion and bruise, sprain, and bone 
fracture, but sharp increases in ligament injury and rupture occurred in 2011 and 
high levels of their occurrence continued thereafter. It has thus been found possible 
to sharply reduce the occurrence of some injuries for which sharp increases had 
preceded, by investigating their cause followed by actions such as intervention for 
their prevention. 
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5.5 Childhood Home-Injury Simulation 


5.5.1 Background of Simulation 


Most accidents involving children below the age of five occur within their homes. 
Since it is important to maintain a safe home environment for children, it is imperative 
to be able to predict what kinds of accidents may occur in a particular environment 
and then to find ways to improve that environment. However, the various and scattered 
statistical data sources and scientific knowledge related to accident prediction have 
not been structured for integrative utilization. In this section, the authors report on 
the development of a new simulation technology that can be used to predict the 
kinds of accidents that may occur in a particular environment by means of a hybrid 
memory- and model-based approach. The system consists of a graph-structuralized 
accident database created from large-scale accident data (which enables the memory- 
based approach) and a development behavior model which describes the statistical 
relationship between a body interaction abilities and the age of children. 


5.5.2 Home-Injury-Situation Simulation System 


In this study, in order to predict child-related accident situations which may occur 
in an individual environment, we propose a home-injury-situation simulation sys- 
tem which consists of three functions: a development-related behavior prediction 
function, an accident situation search function, and a function for classifying prod- 
ucts involving similar risks. The configuration of the proposed system is shown in 
Fig.5.13. 

The development-related behavior prediction function is used to estimate the area 
that can be reached by a child’s hands and then visualize that area in 3D space on a 
computer. 

The accident situation search function is used to look for specific accident situa- 
tions that involve a product extracted from accident situation structure data. These 
accident situation structure data reports describe time series changes of the accident 
situation in a graph-structuralized form by utilizing text mining technique. 

The similar-risk-product classification function uses a clustering method to iden- 
tify products that involve similar risks. In the clustering, shape features and the 
accident types are used as feature vectors. 

With these functions, when a user inputs target environment and child age infor- 
mation, the system calculates possible interactions such as “grasping object” using 
the developmental behavior model by considering the range of products which exist 
in the target environment. The system also locates accident data related to such 
products using the graph-structuralized accident database and then outputs possible 
accidents corresponding to the target child’s development stage. In addition, the sys- 
tem attempts to determine the potential product risks using the third function even 
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Fig. 5.13 System configuration of Home-injury-situation simulation 


if there are very few or no past reports of accidents involving the products in the 
target environment. This case-based prediction facilitates accident forecasting even 
if product and children interaction knowledge is insufficient. 

In this study, we select accidental ingestion and burn/scald injury as concrete 
example injuries in order to confirm the effectiveness of the system. 


5.5.3 Development Behavior Prediction Function 


Since child behavior changes significantly as development progresses, it is neces- 
sary to consider developmental stages when predicting child-related accidents. The 
development-related behavior prediction function visualizes the behavior of children 
in a virtually constructed environment using the development behavior and semantic 
3D models described below. 

Touch and climbing behaviors are among the primary causes of accidental inges- 
tion and burn/scald injuries. One example reads, “When an electric cooking plate 
was being used on the table, a boy climbed onto a chair and touched the edge of 
the plate, thus burning his finger.” This example shows that even if an object is not 
placed on a floor, it can burn a child if he or she is capable of climbing. Therefore, in 
the current system, we implemented a function for predicting climbing and reaching 
behaviors based on body measurement and behavior characteristics collected from 
more than 2,000 Japanese children. 

The statistical data using this database were published as a book for a product 
designer in 2013. Using this database, we created a behavior model that describes 
the probabilistic relation between the height of a pedestal that a child can climb to 
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Fig. 5.14 Statistical data on reachable area 


and the reachable horizontal distance from the edge of a pedestal. This model allows 
the system to calculate the probability that the child might touch an object placed at 
one of a variety of heights. Figure 5.14 shows the relationship between the reachable 
horizontal distance from the edge of a pedestal and the pedestal height. 

When a user inputs information on a target environment, such as a furniture 
arrangement, as shown in Fig.5.15, the system can predict the range of child behav- 
ior that can occur within the target environment. The user inputs environmental 
information by constructing and arranging 3D object models in a virtual environ- 
ment. The system utilizes the 3D game engine Unity to achieve a function suitable 
for constructing a target 3D environment on a computer. Each 3D object model has 
semantic information such as the object name and child-related interaction behavior. 
Figures 5.15 and 5.16 show visualization examples. 

Figure 5.15 shows that the child can touch yellow objects and that whether the 
object is touchable depends on the pedestal height, the horizontal distance from the 
edge, and child’s age. For example, although two-year-old children cannot touch the 
object put at a height of 800mm, four-year-old children can touch the object put at 
a height of 800mm and a distance of 100mm from a edge. Figure 5.16 shows that, 
depending on age, the child can climb to the red top faces. 


5.5.4 Accident Situation Search Function 


Conventional accident data contain detailed information in a free descriptive sentence 
format. However, it is difficult to utilize free descriptive data for situation predictions. 
Recently, our research group has been developing a graph-structuralization-based 
data mining technique [1] to provide a useful tool for obtaining knowledge on causal 
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Fig. 5.15 Visualization of reachable objects 


Fig. 5.16 Visualization of climbable places 
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relationships arising from interactions between objects and human beings. The graph- 
structuralization-based technique allows data mining by first converting the free 
descriptive sentence into graph-structured data and then applying a graph analyzing 
method to the data. Using our software, a user can transform free descriptive data 
into graph-structured data that express time series relationship changes between 
agents such as a child and a parent, a product, and interaction behavior with the 
product. We have also collected over 30,000 childhood injury case data reports in 
cooperation with hospitals, with which we created an accident situation structure 
database which consists of the data on 681 burn/scald accidents and 1,221 accidental 
ingestion incidents. The accident situation search function can be used to find possible 
accidents from the accident situation structure database by taking into consideration 
both the child’s behavior development stage and the past accident data. 


5.5.5 Similar-Risk-Product Classification Function 


Objects that cause similar accidents often show shape and characteristic resem- 
blances. For example, objects related to hot water, such as electric kettles and electric 
pots, can cause burn injuries. Therefore, classification of products from the viewpoint 
of product characteristics is important for predicting potential risks from products. 
Such risk predictions allow us to find potential risks even if a new product has not 
been responsible for any previous injuries. To implement the similar-risk-product 
classification function, the authors conduct hierarchical clustering using the features 
of the objects. 


5.5.6 Simulation Example of the Accident Situation 


Figure 5.17 shows examples of behavior visualization in a target 3D environment. 
Each simulation was performed using the functions stated above. By visualizing a 
child’s behavior by age, it is possible to check changes in child behavior on the input 
environment. For example, in Fig.5.17, although neither a desk nor a chair can be 
reached at when a child is less than 1 year old, they can both be reached when the 
child is more than 2 years old. 

Figure 5.18 shows an example of similar objects found when an accident situation 
is input. In this example, the system simulated not only accidents related to tobacco 
and soup, which exists in the environment, but also those resulting from objects 
similar to soup, such as boiling water, tea, and heated baby food. 
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Fig. 5.18 Search for potential risks from objects having similar features 
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Fig. 5.19 Search for potential risks from objects having similar features 


5.5.7 System Verification 


To demonstrate the validity of the developed simulation, we reproduced actual ordi- 
nary home environments in which accidents had occurred and compared the incident 
reports with the simulation results predicted by the system. Actual injury data and 
environmental information were collected during home visit investigations. To date, 
we have collected such data from 21 ordinary homes where children were injured. At 
this stage of the evaluation, we selected four environments where burn/scald injuries 
occurred and one where an accidental ingestion occurred. 

The evaluation process proceeded as follows: First, we input environmental infor- 
mation such as the house layout, furniture placement, and the accident situation and 
conduct a simulation of injury prediction. Figure5.19 shows the simulated home 
floor plans and the 3D environmental models created using the information provided 
in the investigation. 

Table 5.1 compares actual data with simulated results. In Table 5.1, the “Product” 
column indicates a the type of product related to an accident. “Age in accident 
data” indicates the age of the children when the accident occurred. “Minimum age” 
indicates the minimum age set in the simulation that children could touch the products 
that could cause burn/scald and/or accidental ingestion. “Number of accident cases” 
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Table 5.1 Comparison between actual data and simulation result 


Product Age in accident data | Minimum age in Number of accident 
(months old) simulated results cases 
(months old) 
Internal medicine 17 12 10 
Pot 57 36 25 
Detergent 17 12 21 
Stove 50 12 5 
Fan heater 11 0 0(8) 


indicates the number of accident cases, and the number in the parenthesis indicates 
the number of accidents due to similar products found by the similar risk product 
classification function. The minimum age in the simulation is always less than the 
ages given in the accident data. This suggests that the minimum age set by the 
simulation was appropriate. 

It should also be noted that the simulation succeeded in finding 13 out of 14 
accident cases that actually occurred in the environment used for verification in 
this study. This confirms that the developed simulation works for finding various 
accident types. The single incident that the simulation failed to identify involved a 
parent holding a child who grasped an electrical pot located at a high level. Since 
this incident relates more to the parent’s behavior than to the child’s, we believe that 
the simulation is capable of replicating all incidents that a child might cause by his 
or herself. 


5.6 Social Impact Engagement Based on Big Data Analysis 
in Cooperation with Multiple Stakeholders 


5.6.1 Engagement for Preventing Soccer Goal Turnover 


The developed system was applied to the Injury and Accident Mutual Aid Benefit 
System data compiled by the Japan Sport Council, to analyze 1,921 cases of injury 
involving soccer goals that occurred at elementary and junior and senior high schools 
in AY2014. Accident circumstances included injury suffered from colliding with a 
soccer goal, tripping on a soccer goal or net and falling, or transporting, installing, 
cleaning, hanging from, or jumping into a soccer goal, by a soccer goal overturning 
by wind, from falling while climbing or sitting on a soccer goal, or by tools or weights 
used to secure a soccer goal. Some of these accidents were fatal, and in analysis for 
accidents involving soccer goal overturn, we found 29 [2]. More specifically, the 
circumstances were as follows: 
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e A student acting as goalkeeper on the school grounds in a soccer match during a 
physical education class was overjoyed when a shot flew wide of the goal frame 
and then hung from the goal, fell, and became pinned under the goal and had one 
or more teeth knocked out by the goal. 

While playing soccer in a tournament, the goalkeeper was struck in the neck by a 
goal tipped over by strong wind. 

In the lunch-hour break, a student was playing tag at an outlying area of the sports 
ground when several other children pulled on the net of a mini-soccer goal, which 
fell over and happened to hit the student, who was passing by, in the right side of 
his/her face, bruising the student in the head. 


In the analysis, it was possible to roughly identify the circumstances of accidental 
overturning of the soccer goal, but quantitative determination of the size of the 
risk in analysis with these data alone was difficult, and it was therefore difficult to 
quantitatively assess the importance and specific method of preventive measures. In 
our attempt to determine means of prevention, we therefore measured the impact of 
the overturning soccer goal and the force required to overturn it. Because a soccer 
goal overturning accident had occurred when someone hung from the crossbar, we 
also measured the force on the soccer goal when an individual hung and swung 
from it. 

For two aluminum goals and one steel goal, we overturned each by ropes attached 
to the crossbar and measured the resulting impact with an impact force gauge holding 
a load cell sensor mounted on the crossbar where it hit the ground. 

In each case, the ropes were pulled gently to avoid imparting a shock load and 
the pulling was stopped when the soccer goal began to tip over, and the goal was 
thereafter left to turn over under its own weight. The pulling force was simultaneously 
measured by a small load cell sensor attached between the ropes. 

As shown in Fig. 5.20, for the measured impacts when each goal overturned, the 
maximum value was 9,521 N for one aluminum goal, 18,980 N for the other, and 
29,283 N for the steel goal. The impact of the steel goal was thus found to be 1.5-3 
times those of the aluminum goals. Consideration of the relation between impact and 
injury indicates that the human skull will fracture under an impact of 3,000-5,000 N 
[3], and the results thus showed that impact by any one of these goals would be 
sufficient to pose a risk of skull fracture. 

As noted above, we measured the force required to overturn a goal in the exper- 
iment with a small load cell sensor mounted between the ropes used to pull on the 
goal. The measurement was performed for an aluminum goal alone and with one of 
the various weights (from 20 to 80kg in 20kg increments) attached to its lower rear 
bar, with the results shown in Fig. 5.21. With no weight attached, the goal was found 
to be overturned by the small minimum force of 242.2 N (24.7 kgf), and the pulling 
force required to overturn the goal was found to increase in an approximately linear 
correlation with the increase in the attached weight, with a slope of 0.94 when the 
pulling force was expressed in kilograms. This was approximately equal to the 0.89 
ratio of the 223 cm length of the rearward-directed bar relative to the goal post height 
of 250cm, thus indicating that the goal post lower end functioned as the fulcrum in 
the principle of the lever. 


5 Living Safety Testbed Group 129 


35,000 

30,000 

25,000 
Z 
g 20,000 
2 
8 
g% 15,000 
£ 

10,000 

5,000 | 

0 p p 2 
Aluminum Aluminum Steal 
Soccer goal (1) Soccer goal (2) Soccer goal | 


Fig. 5.20 Impact of overturning soccer goal 
1,200 
1,000 


800 


Pull force required to overturn [N] 
O 
oO 
oO 


0 20 40 60 80 100 
Weight [kg] 


Fig. 5.21 Force required to overturn soccer goal 


The most common circumstance of soccer goal accidents at schools is that of 
a child hanging and swinging forward and rearward from the goal crossbar. The 
horizontal load required to overturn the goal in such circumstances was simulated 
and measured in an experiment with a constructed steel-post assembly in which a 
biaxial load sensor was attached to each of the two ends of the horizontal bar and the 
horizontal and vertical loads were measured. The experiment was performed as one 
of 10 cooperating junior high school students hung and swung. Figure 5.22 shows the 
maximum horizontal loads found in the trials. Overall, the maximum applied force 
found for any of the forward and rearward swinging was 405.4 N (41.4 kgf). 
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Fig. 5.22 Horizontal load with forward and rearward swinging 


Taken together, the results indicated that the force of crossbar impacts near ground 
level when the goal overturned ranged from a minimum of 3,887 N to a maximum 
of 29,283 N and thus posed a high risk of causing skull fracture. It was found that 
an aluminum goal was overturned by a small force of 242.2 N (24.7 kgf) and that 
a child hanging and swinging forward and rearward imparted a horizontal force of 
405.4 N (41.4 kgf) on the crossbar, and thus, it was found that a soccer goal will be 
readily overturned by the swinging action of just one student if not securely fastened 
down or having movement curtailed by a mounted weight. 

These results have been presented at symposia, and the specific data have been 
shown and led to consciousness-raising activities. 


5.6.2 Engagement for Preventing Vaulting Box Accidents 


Analysis of 97,716 accidents relating to elementary school exercise activities recorded 
in the Injury and Accident Mutual Aid Benefit System data of the Japan Sport Coun- 
cil in AY2014 showed that vaulting box exercise accidents were most numerous [4]. 
They numbered 14,715 and thus accounted for approximately 15% of the total acci- 
dent number. Among injuries suffered in vaulting box accidents, bone fractures were 
most numerous and accounted for approximately 37% of all injuries. The circum- 
stances of vaulting box accident occurrence include run-up, takeoff, time from start 
to end of hand contact, landing, and forward somersault on platform, with accidents 
occurring in the largest number during the time from start to end of hand contact. 
Data analysis showed that many bone fractures occurred in the vaulting box exercise, 
again with most occurring during the time from start to end of hand contact. Further 
details are lacking, however, and in the present state of data on accident circumstances 
or child movements, application to injury prevention would be difficult. 
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We therefore performed observation and pattern classification of the relationship 
between vaulting box vaulting, and the risks involved in actual classes, in collab- 
oration with Toshima Ward Fujimidai Elementary School and physical therapists. 
The patterns found included low momentum in takeoff, incorrect arm support, and 
insufficient center of gravity movement resulting in contact of buttocks with hand 
on vaulting box and leading to wrist sprain or failure to vault from vaulting box and 
impact of buttocks on vaulting box, and concentration on forward movement alone 
leading to loss of balance and falling on landing. Based on this analysis, we have 
developed a system that shows vaulting with risk of accident, vaulting action check- 
points, and practice methods for correction of ineffective moves (Fig. 5.23) and will 
proceed with its evaluation and modification through actual utilization at elementary 
schools. 
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Fig. 5.23 Software supporting guidance on vaulting box safety 
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5.7 Conclusion 


In this report, we have described trend analysis functions important for advancement 
of school safety in application to multi-organizational dispersed data utilizing basic 
security technology and performance analysis of actual judo accidents at schools. 
In problems elucidated through use of the system under development in this study, 
we have engaged in acquisition and analysis of detailed data necessary for injury 
prevention and described our engagement in studies on accidents in soccer goal 
overturning and vaulting box activities. 

We will further apply our system currently under development to actual sites of 
activity while further advancing verification and investigate ecosystems for perfor- 
mance of injury prevention in actual on-site utilization of the system. 
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Abstract Under the new law for the secondary use of medical information, which 
was activated in May 2018, the future expected secondary use with information 
anonymization may contribute to research and development in the medical field of 
integrated medical research and public health. On the other hand, under the revised 
Personal Information Protection Law and the revised ethical guidelines in medical 
research, privacy protection and patient consent management is a crucial issue for the 
management of researches. Our JST CREST project, which started in March 2014, 
has issued the development of technological elements and synthesized the developed 
methods for real-world system for the secondary use and privacy protection of big 
data on cloud infrastructure, including safe clinical information management, com- 
mercial cloud utilization, and privacy risk evaluation. In this paper, assuming the 
utilization of the Standardized Structured Medical Record Information Exchange 
version 2 storage, the following target issues are described: (1) effective utilization 
of existing standardized storage, (2) secure data collection across medical institu- 
tions, (3) privacy risk evaluation in analysis, and (4) traceability while secondary 
use. 
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6.1 Overview of Legislation and Standardization for the 
Secondary Use of Electronic Medical Records 


6.1.1 Personal Information Protection Act and 
Next-Generation Medical Infrastructure Act 


The Personal Information Protection Act [1] was revised in September 2015 and was 
fully enforced in May 2017. Prior to the revision, the Personal Information Protection 
Act was established in 2003 and fully enforced in 2005. At the time that the previous 
law was established, both Houses of Councilors recognized that it was insufficient, 
with the establishment of separate laws being required in multiple fields, including 
medicine, and it was not actually reviewed. This situation remained for a decade 
or more. It is now high time for it to be revised. Consequently, several problems in 
the previous law were improved; however, a few problems still exist, and some new 
concerns have emerged. These include fears that secondary use, which is essential 
in the medical treatment field, will become problematic. 

To avoid a negative impact on innovation, including drug discovery and medical 
equipment development, the Next-Generation Medical Infrastructure Act (official 
name: Act on Anonymously Processed Medical Information to Contribute to Med- 
ical Research and Development) [2], specializing in the secondary use of medical 
data, was enacted in April 2017 and enforced on May 11, 2018. In this paper, we 
investigated the issues related to the Personal Information Protection Act and the 
predicted effects of the revision to the law, provided an overview, and considered the 
impact of the Next-Generation Medical Infrastructure Act. 


6.1.1.1 The Personal Information Protection System 
in Japan and Related Issues 


The main objective of reviewing the previous law was to respond to the EU directive 
concerning cross-border personal information in 1995 and the concerns about pri- 
vacy infringement as a result of the resident registration network brought about by 
revisions to the Basic Resident Registration Act. Furthermore, there were concerns 
regarding the prospect of eavesdropping being made possible with court approval 
through revisions to the Criminal Procedure Code. Also, as previously stated, this 
was fully enforced in April 2005. Basically, this conformed to the OECD personal 
information cross-border guidelines [3]; however, allusions to several issues have 
been identified. The problems in the medical area include the following facts: the 
law is a comprehensive one that does not specify the field; the characteristics of 
medicine frequently provided to third parties, an essential purpose, are not being 
considered; the definition of personal information is ambiguous, which inevitably 
makes anonymization difficult; and focus is placed on protection, and so, the promo- 
tion of reuse that does not violate the right of the individual, which was the original 
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purpose of the law, is largely ignored. There is also the fact that, in a private sense, it 
is aimed at the operator, and where there is an individual causing the infringement, it 
is an indirect regulation concerning supervision by the operator. Moreover, the penal- 
ties are light and indirect and thus lacking in effectiveness. The different systems for 
acquiring personal information are enforced for governments, independent admin- 
istrative agencies, local governments, and private enterprises, and this is considered 
to obstruct the utilization of personal information across these frameworks. 

In the revised Personal Information Protection Act, the concept of important infor- 
mation has been introduced, and as the vast majority of medical data is designated as 
important information, this is a step forward from a comprehensive law that does not 
specify fields. When acquiring important information, explicit consent is required, 
and third-party provision based on opting-out, which can occur when providing such 
information to third parties while the intentions of the person concerned are still 
unclear, is prohibited. This is clearly a step forward and promises to suppress provi- 
sion to third parties where this is not intended by the person concerned. On the other 
hand, in the case of third-party provision, which is essential in collaborative medicine, 
while there were concerns about the explanation of symptoms to family members, 
consultations with specialists, etc., this is largely covered by the clear definition of 
opt-out consent as “implicit consent” in guidance concerning the appropriate han- 
dling of personal information by healthcare and nursing care providers (hereafter, 
“healthcare and nursing care guidance”), implementation guidelines issued jointly 
by the Personal Information Protection Commission (hereafter, “PPC”) and Min- 
istry of Health, Labor and Welfare (hereafter, “MHLW”). However, the fact that, 
under law, clear consent is required for provision to third parties for the purpose 
of drug discoveries for the development of medicine or medical equipment remains 
unchanged. Gaining clear consent places a considerable burden on medical sites, and 
even where there may be no intention to violate rights, it must be considered that this 
is significantly more problematic than with the previous law. In the revised law, the 
concept of anonymized information has been introduced, and by anonymizing data 
in accordance with the standards of the PPC, this may be provided without consent 
under certain conditions. 

However, it is necessary to impose the conditions of prohibition on reidentification 
and safety management on the recipient of the data, and it is procedurally complex to 
make third-party provision with anonymized information a public duty. Additionally, 
to meet the anonymization standards of the PPC, a certain amount of information 
processing capability is required, which is not simple. While this is not a legal item, in 
regard to important information, another feature of the revised law is that traceability 
must be secured. Moreover, although a significant impact is feared in healthcare and 
nursing care fields where there is frequent provision to third parties, in the healthcare 
and nursing care guidance, this is virtually all considered as essential for healthcare 
and nursing care, thus avoiding a major increase in the workload of the healthcare 
and nursing care institutions. On the other hand, in regard to provision to third parties 
involved in secondary use that is not essential for healthcare and nursing care, the 
creation of records and their confirmation at the time of receipt are required. For 
genetic information, as well, having a personal identifier code specified from which 
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the individual can be identified, provided certain conditions are met, has immense 
medical significance. 

The points alluded to earlier are all features of the previous and revised law as 
seen from the perspective of the healthcare and nursing care field. Furthermore, as 
the responsibility for enforcing the revised law has been centralized in the PPC and 
penalties have been significantly increased, it has become more effective. Major 
changes, such as the conditions for distributing personal information overseas being 
clarified, have been determined, but these will only be listed in this chapter. 

The revised law promises to improve several issues in the previous law. The 
strengthening of punitive measures increases its effectiveness, and the introduction 
of the concept of important information reduces discrimination based on the ille- 
gal use of special personal information, preventing its use through provision to a 
third party not intended by the person concerned. However, several issues remain 
unresolved. The first of these is that, as operations are based on different regula- 
tions from the government, independent administrative bodies, local governments, 
and private enterprises, there are about 1,800 autonomous bodies and close to 2,000 
statutes. Certainly, there are not any major differences in their basic thinking, but the 
executing body varies depending on the statute, and subtle decisions are made by 
each executing body. In the case of healthcare and nursing care, the body is a private 
company, but the local government and institutions at its rank often contribute, as do 
national institutions and independent administrative bodies. 

For example, if one prefectural, two city, and two town hospitals and five private 
medical institutions collaborate to share organic patient data, it will be necessary for 
at least four autonomous bodies to review whether this is possible. For healthcare 
providers looking to move forward, this can become a significant burden. Currently, 
the fact that the statute may be different depending on the acquiring body has not been 
improved at all. For hereditary information, it is expected that genetic information 
will be specified with a personal identification code and handled prudently; however, 
under the Personal Information Protection Act, consent gives an absolute pardon. On 
the other hand, in the case of hereditary information, even if the person providing 
the information provides consent, the impact of such may extend to blood relatives 
such as parents and offspring. If, as a result of a parent’s consent, a child became the 
victim of discrimination, this could not be handled under the Personal Information 
Protection Act. At the current time, improving this point seems to be not possible, 
and several people indicate that this is an issue. This should be reviewed in the near 
future, and it is to be hoped that it will be resolved quickly. 


6.1.1.2 Review and Establishment of the Next-Generation Medical 
Infrastructure Act 


As previously described, with the revisions to the Personal Information Protection 
Act, although several of the issues in the previous law were improved, secondary 
use, where there is no intention to violate personal rights and the aim is to use per- 
sonal information for the public good, was previously possible via opt-out. However, 
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with the revisions, this is no longer possible. Healthcare and nursing care must be 
performed based on medicine, but this cannot develop without the use of patient and 
user data. Immediately utilizing research results obtained using the laboratory or ani- 
mals in medicine and healthcare is not possible, and human knowledge is essential. In 
other words, if this type of usage is suppressed, the acquisition of medical knowledge 
itself will likely be suppressed, and this may obstruct the development of healthcare 
and nursing care itself. If medical institutions and nursing care providers are able 
to anonymize, they will be able to provide data for secondary use without consent. 
However, in the case of regional comprehensive care and collaborative medicine, 
information is distributed between multiple operators. Therefore, unless information 
can be concentrated in a single institution through a joint use declaration, linking and 
anonymizing the disparate information will not be possible. Anonymization makes 
reidentification impossible, and so anonymized information cannot necessarily be 
linked. The simple solution would entail making a joint declaration of use; however, 
in this case, the perimeter of information for joint use and other information must 
be clarified. In Japan, the healthcare and nursing care services can be freely chosen 
by patients and users, so setting the perimeter is essentially difficult. Additionally, it 
is necessary to announce the fact that anonymization is taking place, and this can- 
not be provided without restriction. A prohibition on reidentification is sought from 
the recipient, and although this is effort based at best, safety management is also 
required. The provider has no duty to supervise the destination, but if an incident or 
illegal use occurs, the complaint from the individual embodying the information will 
be directed at the providing medical or nursing care institution, which may result in 
a civil lawsuit. Supervision may be considered to be mandatory. Although it is not 
impossible, some preparedness and effort are required. However, it is not desirable 
that this situation impacts the development of medicine/medical equipment or drug 
discovery. Regarding academic research, in Chap. 4 of the revised law, it states that 
although various duties are not placed on the operators acquiring the personal infor- 
mation, this is limited to academic research by academic research institutions, and 
although there are calls to draw up and execute guidelines for those not covered by 
Chap. 4 of the revised law, such guidelines are difficult to implement on a statutory 
basis. 

Faced with this situation and the awareness of the need to promote use for the 
public good that does not violate the rights of the individual, the Cabinet Secretariat 
and Office of Healthcare Policy primarily reviewed the measures, whereupon the 
Next-Generation Medical Infrastructure Act was submitted to the Diet as Cabinet 
legislation. The basis of this held that if operators with the ability to perform reliable 
and safe anonymization, who were able to provide safe information for the public 
good in a broad sense, were accredited and medical information was provided from 
the accredited operators to the medical institutions, consent could be provided via an 
opt-out system. This was established at the end of April 2017 and delivered in May. 
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6.1.1.3 Content of the Next-Generation Medical Infrastructure Act 


This law focuses on accredited anonymizer medical data creation operators, and as 
previously described, operators who can perform anonymization reliably and handle 
and provide information safely are accredited by the government. The law intends, 
“through the safe and appropriate utilization of anonymized medical data, to promote 
cutting-edge R&D related to health and healthcare, and new industries, and contribute 
to the development of a society where people live healthy and long lives.” The aim is 
not simply commercial use but use for the public good in a broad sense. Although its 
scope is narrow, it is positioned as an individual law from the Personal Information 
Protection Act, and this overwrites such Act. 


Definition of Wording 

This law is not aimed at general personal information but “medical data.” The main 
target is the information related to healthcare, which is a type of important informa- 
tion, and the definition has been slightly expanded. The Personal Information Pro- 
tection Law covers the information of living individuals, but the Next-Generation 
Medical Infrastructure Act includes medical information on deceased people as well. 
In healthcare, life and death exist consecutively, and so, this can be considered to be 
a reasonable extension. Additionally, in the revised Personal Information Protection 
Act, the guidelines for anonymization are indicated by the PPC, whereas the guide- 
lines for the anonymizer medical data in the Next-Generation Medical Infrastructure 
Act are provided by the minister in charge. However, the wording in the definition is 
the same, and the law clarifies that this should be determined after consultation with 
PPC. Anonymizer information and anonymizer medical information are basically 
the same, but with the latter, it is possible to provide detailed guidelines depending 
on the case in which it is used. 


Accredited Anonymizer Medical Data Creation Operators 

The core of this law is the stipulation of accredited anonymizer medical data creation 
operators. This is limited to companies who possess appropriate anonymizer capa- 
bilities and can provide information to operators who can handle the safe anonymizer 
information in accordance with the law. The anonymizer work of such operators does 
not apply to stipulations regarding the creation of anonymizer information in Article 
36 of the Personal Information Protection Act. Additionally, the safe management 
of information and an appropriate response to this are required. This also does con- 
travene the concept that this is provided to contribute to R&D in the medical field, 
and use that exceeds the scope of achieving the objectives of the accredited operator 
is not recognized. 

With this, no particular restriction is noted on the operator other than the accred- 
ited work. Additionally, provided that the information before the anonymization was 
for the operator to create the anonymizer medical data, it may be provided to other 
accredited anonymizer medical data creation operators within the scope of that pur- 
pose. In this way, if, for example, accredited operator A is mainly accumulating 
hospital information and operator B is mainly collecting clinic information, it is pos- 
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sible for A to provide to B and B to provide to A and create anonymizer medical data 
after linking the medical information of the clinics and hospitals. 


Operators Handling Medical Data 

This refers to medical institutions, and broadly speaking, two types of regulations 
when providing medical data to accredited anonymizer medical data creation oper- 
ators are described. The first is notice to the patients and notification to the minister 
in charge. In the notice, it must be clarified that provision shall be stopped if there 
is a request for such from the patient or a bereaved family member. A point to note 
here is that this is just described as “notice” to the patient. Simply presenting it is not 
enough, and the content of the notice must be actually notified to the patient, etc. The 
second point is that if provision is stopped due to the request of the person concerned 
or the bereaved family, there is a duty to issue evidence in writing that there has 
been a request to stop provision, and a copy of this must be stored. In case there is 
a request to stop the provision of medical data owned by the accredited anonymizer 
medical data creation operator, this information may not be received. 


Operators Handling Anonymizer Medical Data 

Recipients provided with anonymizer medical data from the accredited anonymizer 
medical data creation operators are exempt from the stipulations of Articles 37 (pro- 
vision of anonymizer information), 38 (prohibition on identification action), and 39 
(safety management measures, etc.) of the Personal Information Protection Act. On 
the other hand, in the Next-Generation Medical Infrastructure Act, reidentification 
itself is prohibited. This should not just be a penalty stipulation for “operators han- 
dling anonymizer medical data,’ and the restriction of agreements with accredited 
anonymizer medical data creation operators is also necessary. If an actual breach 
occurs or the agreement conditions lack effectiveness, the application of the Unfair 
Competition Prevention Act should also be considered. 


Accredited Medical Data Handling Contractors 
Operators undertaking the work of accredited anonymizer medical data creation 
operators need to be accredited by the government. 


6.1.1.4 Opting-Out Under the Next-Generation Medical 
Infrastructure Act 


Provision to third parties is specified with opt-out under Article 23, paragraph 2, 
of the Personal Information Protection Act. When providing to a third party after 
notifying the party concerned or in a situation where the person concerned could 
easily learn of the fact, provided that there is no motion of refusal from the person 
concerned, it may be provided to a third party. Originally, this was prohibited in 
cases involving sensitive information, and so, medical data cannot be provided to a 
third party in this way. In contrast, in the Next-Generation Medical Infrastructure 
Act, as long as the third-party provision is to an accredited anonymizer healthcare 
information creation body, an exception shall be granted, and this may be provided 
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in an opt-out form. However, it is only permitted to be provided to a third party after 
notifying the person concerned if there is no motion of refusal. In other words, it just 
being a situation where they could easily learn of the fact is not enough. 


6.1.1.5 Safety Management Measures 


The safety management measures section stipulates the safety management measures 
to be taken by accredited anonymizer healthcare information creation bodies, and 
the contents are as follows: 


1 Purpose and target of application 

2 Concrete measures 

2-1 Organizational safety management measures 
2-2 Human safety management measures 

2-3 Physical safety management measures 

2-4 Technical safety management measures 

2-5 Other measures 


These can be considered to be typical chapter headings, and the majority of these 
are not particularly different from the MHLW “Security Guidelines for Medical 
Information Systems.” However, the network is limited to dedicated lines and IP- 
VPNs within accredited operators. In terms of availability, while the superiority 
of dedicated lines is unquestionable, as they are not clearly superior in regard to 
completeness or anonymity, implementation may be difficult when cost is considered. 


6.1.1.6 The Future of the Next-Generation Medical 
Infrastructure Act and Issues 


If the Next-Generation Medical Infrastructure Act functions as intended, the reg- 
ulations strengthened in the revised Personal Information Protection Act can be 
introduced in a form without the risk of violating the rights of individuals. Moreover, 
regarding the purpose restricted to R&D in the medical field, there are expectations 
that it can be promoted in a safe and significant manner. However, two main issues 
are identified. The first is the establishment of the system itself. Although the law 
has been established, we are still waiting for the establishment of a basic policy as 
well as the government and ministerial ordinances delegating the main part of this 
work. At present, only the outline has been fixed. We do not yet have a system with 
“meat on the bones” to be used in actual operation, and efforts by all related parties 
are required. Additionally, it is expected that there will be public comments once a 
draft of the government and ministerial ordinance or guidelines is determined, and 
hopefully, several people will have constructive comments from many people. The 
second issue is that although the accreditation of anonymizer medical data creation 
operators is a public work contributing, in a meaningful way, to R&D in the medical 
field, the law presumes the accreditation of private operators. In other words, the 
accredited operators must both maintain their own survival and continue work with 
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significant public work elements. This is certainly not simple for a private operator. 
Unless the accredited operator can gain trust, the medical institutions, etc., will lose 
enthusiasm for the provision, and the system itself may fail. If we consider the world 
aimed for by this law to be significant, the support of not only the administration 
and operators aiming for accreditation but also a wide range of people, including 
medical-related parties and patients, is required. 

Even if the aforementioned issues can be safely overcome and operations begin, 
problems will remain. We have repeatedly indicated that this framework is based on 
accredited anonymizer medical data creation operators who are private companies. 
However, at present, the government, local governments, and insurers are systemati- 
cally accumulating information, and much of the useful medical data is owned by the 
government. For example, information on life and death is the ultimate outcome of 
treatment, and to determine this outcome with certainty, basic resident registration 
information and death certificates, etc., must be accessed. Although according to 
this law, cooperation on consent with accredited anonymizer medical data creation 
operators is possible, there is no consideration at all regarding collaboration on the 
information owned by the government, local governments, and insurers under this 
system. Despite the fact that R&D in the medical field is urgent in terms of maintain- 
ing social security, it must be indicated that there is a problem in terms of efficiency, 
based on the Next-Generation Medical Infrastructure Act alone. It is considered nec- 
essary to establish an external system to promote a comprehensive system for using 
information for the public good for government and private enterprise. Additionally, 
the security and anonymizer standards are somewhat abstract. In the case of technol- 
ogy that uses individual data with Privacy Preserving Data Mining and multiparty 
protocols in its anonymous form for calculation purposes only, despite the fact that it 
has been demonstrated that a technical solution is possible, as no consideration has 
been given from a statutory or system viewpoint, it is difficult to judge whether this 
can be used under the Personal Information Protection Law. While promoting tech- 
nical initiatives, it is also necessary to clarify positioning in a statutory and system 
sense. 


6.1.2 Ethical Guidelines and Anonymization of Medical 
Information 


6.1.2.1 Ethical Guidelines 


The Ethical Guidelines for medical and health research involving human subjects 
[4] apply to medical research for human beings and basically requires researchers to 
respond to the request sought by the Act on the Protection of Personal Information. 
However, Chap. 4 of the Personal Information Protection Law shown in the following 
is exempt from clinical research: 
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Chapter IV Obligations, etc., of a Personal Information Handling Business Operator 
Section 1 Obligations of a Personal Handling Business Operator 

Section 2 Obligations of an Anonymously Processed Information Handling Business Oper- 
ator, etc. 

Section 3 Supervision 

Section 4 Private Sector Body’s Promotion for the Protection of Personal Information 


It also applies to the information infrastructure for collecting and analyzing medi- 
cal information that this project aims to build. In large-scale data collection research, 
specific responses required by the Ethics Guidelines are mainly described in the 
informed consent. The description in the Ethics Guidelines is as follows: 


Researchers do not necessarily need to receive informed consent however, if you do not 
receive informed consent, the subject of the study appropriate consent of the However, in 
cases where it is difficult to obtain appropriate consent, information used in other studies 
to conduct research. from 4(1) to (6) for the implementation of the research, if there is a 
particular reason for to be notified or published to the subject of the research, and to ensure 
that the research is carried out or continued. opportunities for research subjects, etc., to be 
denied. personal information may be used. 


Also, the Ethical Guidelines need to notify or publish the following matters to the 
patient, etc: 


(1) The purpose of use and use of samples and information (including methods when provided 
to other organizations) 

(2) Items of samples and information used or provided 

(3) Scope of use 

(4) Name or name of the person responsible for the management of samples and information 
(5) To use samples and information to identify the subject of the research or to other research 
institutions at the request of the research subject or its agent 

(6) (5) How to accept the request of the subject or its agent 


This is also true for information systems that deal with large-scale data. Further- 
more, if the target personal information is the anonymized one, it is not necessary 
to notify the patients. At this time, the opportunity of the consent withdrawal is not 
guaranteed to the patient. 

In fact, regardless of the presence or absence of anonymization processing for 
electronic medical record (EMR) items, in most cases, the content of research that 
uses medical information is made public on the homepage of each medical or research 
institution and patients. It is difficult for patients to understand how their own EMR 
items are used and provided. 


6.1.2.2 Anonymizer Medical Data 


Anonymizer medical data is an extension of the anonymizer information under the 
Personal Information Protection Act. Under this Act, the target information is limited 
to personal information surviving, while under the Next-Generation Medical Infras- 
tructure Act, the information of deceased individuals may also be covered, depending 
on the situation. Additionally, anonymizer information is information from which the 
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Table 6.1 Contents of guidelines for anonymizer information 


1 Positioning of these guidelines 
2 Definition 


2-1 Medical information 


2-2 Anonymizer medical data (related to Article 2, paragraph 3) 


2-3 Anonymizer medical data creation business 


3 Duties of accredited anonymizer healthcare information creation bodies and bodies handling 
anonymizer healthcare information 


3-1 Thinking behind duties regarding the handling of anonymizer medical data 


4 Processing required when creating anonymizer medical data 


4-1 Processing standards for anonymizer medical data 


4-1-1 Deletion of descriptions, etc., from which specific individuals may be identified 


4-1-2 Deletion of individual identification codes 


4-1-3 Deletion of codes that interconnect information 


4-1-4 Deletion of peculiar descriptions, etc. 


4-1-5 Other measures based on the nature of medical information databases 


4-2 Items requested for investigation when creating anonymizer medical data 


4-2-1 Format for using anonymizer data 


4-2-2 Possibility of identification by referring to other information 


4-3 Anonymizer medical data creation process 


4-4 Method of anonymization based on medical data categories 


4-5 Medical data-specific anonymization 


4-5-1 Medical images 
4-5-2 Genome data 


5 Safety management measures, such as anonymizer medical data 


6 Prohibition on identification actions 


7 Provision registration 


individual cannot be identified by ordinary people, whereas anonymizer medical data 
is information from which the individual cannot be recognized by general healthcare- 
related people. As this fulfills the stipulations of the guidelines on anonymizer infor- 
mation determined under the Personal Information Protection Act, processing based 
on additional risk analysis is required. Furthermore, another characteristic can be 
considered to be the fact that even after the provision of the information, there is a 
duty to follow up, including confirming how it is used. The content of these guidelines 
is shown in Table 6.1. 

Another feature is that medical information is categorized from a risk perspective, 
which is shown in Tables 6.2 and 6.3. 
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Table 6.2 Categorization of the risk of individual identification in medical data 


Category 
Identifier 


Overview 


Information directly linked to an individual (name, number of insured, etc.) 


Quasi-identifier 


Static attributes 


Semi-static attributes 


Information that, when multiple types are combined, can lead to the iden- 
tification of the individual (date of birth, organization, etc.) 


*The medical institution code is considered a quasi-identifier 


Highly invariant information (height, blood type, allergies, dates [such as 
consultation dates], etc.) 


Information related to external characteristics such as disabilities 


*Handling of information on chronic illnesses with a high level of invariance 
needs to be reviewed 


Data with universality for a fixed period (weight, etc.) 


It is assumed that this relates to information on diseases, procedures, admin- 
istered medicines, etc. 


Dynamic attributes 


Information that is constantly changing (data on inspection values, food, 
other treatment, etc.) 


Table 6.3 Anonymizer examples through the categorization of medical data 


Category 
Identifier 


Example of anonymization method 


Deleted or irreversible pseudonymization 


Quasi-identifier 


Generalization (date of birth -> year born, address -> prefecture) or micro 
application that satisfies k-anonymity 


Delete data items 


Add attributes (geographical, scale, etc.), such as medical institution codes, 
and convert codes into an unidentifiable form 


Static attributes 


Numerals are top-to-bottom coding 
Generalization or micro application 


Generalization or offset based on treatment date, etc. 


Semi-static attributes 


Numerals are top-to-bottom coding 
Delete sensitive diseases when not necessary 


Dynamic attributes 


Anonymization not required, but where necessary, numbers are top-to- 
bottom coding 


In consideration of the significance of abnormal values, look at the distri- 
bution of values and carry out processing, such as rounding of upper and 
lower % values 


6.1.3 Standardization of EMRs 


The Standardized Structured Medical Record Information Exchange (SS-MIX) [5, 
6] aims to promote/develop the results of the standardized electronic medical chart 
information exchange system development commission project conducted by the 
Health Policy Bureau of the MHLW in FY 2006 in Japan. 
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Fig. 6.1 Overview of SS-MIX2 directory layout 


SS-MIX includes the following: 


(1) Hospital information system information gateway telegraphic message specification 
(2) “Standardized Storage Specification” directory structure 
(3) Electronic medical information CD and patient referral document CD specification 


Furthermore, the scenes where the utilization of this standardized storage is 
expected are as follows: 


Ensuring continuation of medical information Repository in community healthcare coordi- 
nation Information sharing among multiple vendors Utilization as backup information 


Figure 6.1 shows the file system layout of SS-MIX2 storage. Directories are sorted 
by patient ID, clinical date, and event type. 

Table 6.4 shows the clinical event types covered by SS-MIX2 storage represented 
by HL7 v2. We can represent 30+ clinical events using this storage [6]. 


6.2 Medical Test Bed Concepts and Requirements 


Considering the current situation surrounding medical information as described 
earlier and the development of future utilization, the public cloud is used for the 
secondary use of medical data scattered through medical institutions across the orga- 
nization. We are developing a secure information utilization base test bed in the 
medical information field, assuming the utilization promotion by adopting. 
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Table 6.4 SS-MIX2 data types 


No | Data type | Name HL7 message type 
1 ADT-00 | Update of patient’s basic information ADT°08 
2 ADT-00 | Deletion of patient’s basic information ADT°23 
3 ADT-01 | Change of investigator ADT^54 
4 ADT-01 | Cancellation of investigator ADT^55 
5 ADT-12 | Reception of outpatient physical examination ADT°04 
6 ADT-21 | Hospitalization plan ADT’ 14 
7 ADT-21 | Cancellation of hospitalization plan ADT°27 
8 ADT-22 | Conduct of hospitalization ADT01 
9 ADT-22 | Cancellation of conduct of hospitalization ADT’11 
10 | ADT-31 | Conduct of staying outside ADT’21 
11 | ADT-31 | Cancellation of conduct of staying outside ADT^52 
12 | ADT-32 | Conduct of return from staying outside ADT°22 
13 | ADT-32 | Cancellation of conduct of return from staying outside ADT^53 
14 | ADT-41 |Plan of change of department/building (change of ADT^15 
room/bed) 
15 | ADT-41 | Cancellation of plan of change of department/building ADT°26 
(change of room/bed) 
16 | ADT-42 | Conduct of change of department/building (change of ADT°02 
room/bed) 
17 | ADT-42 | Cancellation of conduct of change of ADT^12 
department/building (change of room/bed) 
18 | ADT-51 |Plan of discharge ADT^16 
19 | ADT-51 | Cancellation of plan of discharge ADT°25 
20 | ADT-52 | Conduct of discharge ADT°03 
21 | ADT-52 | Cancellation of conduct of discharge ADT^13 
22 | ADT-61 |Registration/update of allergy information ADT^60 
23 | PPR-O1 | Registration/update of disease name (history) PPR^D1 
information 
24 | OMD Food order OMD^03 
25 | OMP-01 | Prescription order RDE^11 
26 | OMP-11 | Prescription conduct notice RAS*17 
27 | OMP-02 | Injection order RDE^11 
28 | OMP-12 | Injection conduct notice RAS‘*17 
29 | OML-01 | Specimen examination order OML*33 
30 | OML-11 | Specimen examination result notice OUL*22 
31 | OMG-O1 | Radiological examination order OMG^19 
32 | OMG-11 | Notice of radiological examination conduct OMI23 
33 | OMG-02 | Endoscopy order OMG^19 
34 | OMG-12 | Notice of endoscopy conduct OMI23 
35 | OMG-03 | Physiological examination order OMG^19 
36 | OMG-13 | Notice of physiological examination result ORU’01 
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Fig. 6.2 Overview of the system developed for secure data collection and analysis 


The main points in the development of a medical test bed are as follows: 


Unnecessary sensitive information not used for research should not be leaked 
outside the medical institution. 

Securely extract and combine medical information across organizations. 
Information extraction control linked with patient consent is possible. 

Privacy risk can be evaluated for the extracted dataset. 

Patients can verify the history of information utilization on the platform. 


Figure 6.2 presents an overview of the developed system [7]. The key concepts 
are the following: 


1. Each medical institution has EMR data in SS-MIX2 storage, including billions 
of HL7 v2 messages. 

2. HL7 v2 messages are periodically parsed and stored to relational database man- 
agement system (RDBMS) tables, maintaining synchronization with the billions 
of message files in SS-MIX2. 

3. Analysis requests from researchers and data collection are managed by the private 
set intersection (PSI) service on the cloud, which communicates with a client agent 
located at a client terminal and PSI agents located at each medical institution. 

4. Target data criteria, such as diseases, age, and gender, must be defined before the 
PSI executes data collection. The PSI party agent deploys the target dataset in 
advance from the local RDBMS to memory. 
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5. Data collection is achieved using PSI software, which is based on Bloom filter 
technology for record verification across institutions. The application of bloom 
filter technology is aimed at realizing data matching in which personal information 
does not leak outside each hospital during the data collection process. 

6. The collected dataset can be verified considering the possibility of patient identi- 
fication using the extracted attributes. 

7. Patients can trace the use of their medical records during data collection. 

8. If they choose, patients can withdraw consent for the secondary use of their data. 
Consent withdrawal information is assumed to be an input to existing the EMR 
system and exported SS-MIX2 storage in each hospital. 


6.3 Features and Implementations of Secondary Use 
Infrastructure Development 


This section describes key features and implementation details of our developed test 
bed for medical field. 


6.3.1 SS-MIX2 Standardized Storage 


6.3.1.1 Objective 


In Japan, SS-MIX2, which is the domestic standard of exporting whole EHR data 
as HL7 v2 message files to the external storage for the purposes of backup, regional 
collaboration, disease repository, and others, is common. In this standard, EHR data 
is exported to a storage in a directory structure using patient id and clinical date and 
event type. Therefore, the use of the exported storage for cross-patient analysis such 
as epidemiological studies is challenging. We are applying an RDB-based virtual file 
system technology to the storage to achieve cross-patient/cross-institution analysis 
without collecting data files. 


6.3.1.2 Methods 


The overview of the system is shown in Fig. 6.3. The storage is developed based on 
Filesystem in Userspace (FUSE), a virtual file system technology. We adopted pgfuse 
and PostgreSQL as the FUSE and RDBMS, respectively. The recorded HL7 messages 
are stored to the DB tables as BLOB data, and the RDBMS traces the transaction 
in real time. The HL7 messages are parsed by PL/SQL, and parsed medical records 
(HL7 segments, fields) are recorded to user-defined tables in the RDBMS. Parsing 
tasks are intended to be executed periodically. Once the records have been stored to 
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Fig. 6.3 Overview of the developed storage with a virtual file system 


Table 6.5 Evaluation results (s) for various numbers of medical records 


COPY (VFS to HDD) | COPY (HDD to VFS) | Delete (VFS) Parse files (VFS) 
3,261s 1,466s 77158 1,989 s 
5.03 Mbps 2.26 Mbps 9.52 Mbps 54.9 files/s 


the tables, the minimum required items can be queried through individually applied 
view schemas according to the purpose of each analysis project. Performance tests 
are executed with dummy messages of 109,174 files (922 MB in total, 1,689 patients) 
including 27 clinical events defined by SS-MIX2 standard, such as ADT-00, OMP- 
01, and OML-11. 


6.3.1.3 Results 


Performance test results are shown in Table6.5. All types of messages could be 
parsed by PL/SQL. Based on the performance, this storage can process the daily 
generated medical records of our hospital in less than 2h. 


6.3.1.4 Discussion 


The developed storage enables the rapid cycle for the secondary use of medical 
records analysis among institutions and also prevents the disclosure of unnecessary 
patient information to each analysis by the regulations of applying view schemas 
for queries. Moreover, using the developed storage, exported medical records and 
parsed result tables can be more easily backed up to a remote place in real time using 
DB replication technology, compared with synchronizing the enormous number of 
files. 
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6.3.1.5 Summary 


This section describes the development of a standardized storage for the purpose of 
cross-patient/cross-institution analysis based on the domestic EHR data exporting 
standard. We will try to develop a secure data collection infrastructure assuming the 
distributed environment of the developed storages. 


6.3.2 Secure Collection of Distributed Medical Information 


In this section,! we propose an alternative method of collecting and storing EMR data, 
wherein only necessary items are included in collected data, eliminating the need 
for individual identifiable information to spread outside the medical institution. The 
system facilitates EMR data distribution within each medical institution, enabling 
cross-patient or cross-facility data collection and analysis. The PSI library developed 
by Miyaji [8] is used for the data integration and encryption of the extracted EMR 
data. This paper aims to provide an overview of the system and its major technical 
elements and evaluate the transaction performance of data extraction and collection 
from the distributed SS-MIX2 storage. 


6.3.2.1 Methods 


Experimental Environment 

The transaction performance of data extraction and collection from the distributed 
SS-MIX2 storage was evaluated using an experimental environment comprising a 
server (PSI Server), three data stores (PSI Party), and a client (PSI Client). The 
Server and Party machines were deployed as VMware ESXi virtual machines. The 
PSI Client can be deployed on any machine that can run Java. 

Experimental data were virtually produced by anonymizing laboratory test result 
data in the SS-MIX2 storage exported from the EMR system of The University of 
Tokyo Hospital (Tokyo, Japan). Storage assumed to have 10% overlap between each 
node was arranged and used for the evaluation tests. The hash value of the character 
string combining the patient’s name, date of birth, and sex was used as the key 
attribute of each record for the bloom filter. 


'This section is reprinted from “Studies in Health Technology and Informatics, Vol 255, Katsuya 
Tanaka, Ryuichi Yamamoto, Kazuhisa Nakasho, Atsuko Miyaji, Development of a Secure Cross- 
Institutional Data Collection System Based on Distributed Standardized EMR Storage, pp. 35-39,” 
Copyright (2018), with permission from IOS Press. The publication is available at IOS Press through 
http://dx.doi.org/10.3233/978- 1-61499-921-8-35. 
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Fig. 6.4 Overview of the transaction flow during data collection 


Data Collection with PSI 

Figure 6.4 presents an overview of the transaction flow during a secure data collec- 
tion using the system. The entire system was designed as a Web service so that in 
the future the service could be available via a commercial cloud. The PSI applica- 
tion programming interface was developed in Java using SOAP Web services and 
deployed on an Apache Tomcat. All Web communications were implemented with 
client authentication under TLS 1.2. Extracted EMR data are encrypted by Cryp- 
tographic Message Syntax and can be decrypted only by the user requesting the 
collection. 


6.3.2.2 Results 


Table 6.6 summarizes the evaluation test results for data queries for calculations, 
bloom filter calculations, and result data extraction for increasing numbers of EMRs. 
The processing time linearly increased with the number of records. 
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Table 6.6 Evaluation results (s) for various numbers of medical records 


Records 20,000 | 40,000 | 80,000 | 160,000 | 320,000 | 640,000 | 960,000) 1,280,000 
Query data 0.1 0.2 0.2 0.3 0.6 1.0 1.5 2.0 
Bloom filter processing] 0.5 0.5 1.0 2.0 2.9 7.4 13.5 20.7 
Data extraction 3:3 3.0 3.1 3.4 4.0 10.1 15.4 23.5 
Total 3.8 3.7 4.3 5.7 7.4 18.5 30.4 46.2 


6.3.2.3 Discussion 


Significance of the System 

The system was completely achieved using Web service architecture with the encryp- 
tion of the extracted EMR data, indicating that medical institutions participating in 
research would not need to maintain a secure connection to the specific service 
provider if the developed PSI services are operated on the commercial cloud. The 
encryption of EMR data avoids any disclosure of the extracted information to the 
cloud service providers. Furthermore, because the infrastructure makes it unneces- 
sary to connect an EMR storage to the Internet, this eliminates the possibility of 
experiencing network attacks to the data storage. To meet the requirements of a 
given analysis, the PSI can execute not only intersection operations but also union 
operations on distributed datasets. 


Performance 

The experimental results showed that an intersection operation involving approxi- 
mately 1 million records was completed within a minute. With this level of processing 
performance, there should not be any problems with actual operations. We now intend 
to verify this with larger datasets. 


Future Work 

The remaining issues for development include (1) the management of consent infor- 
mation, (2) risk assessment for the extracted dataset, and (3) traceability management 
against data collection. The first issue can be addressed by scanning paper-based con- 
sent information related to patients opting-out of the secondary use of their data and 
storing the scanned data files to the SS-MIX2 storage. We intend to represent consent 
information as XML files, such as HL7 CDA Privacy Consent Directives, Release 1 
[9]. The other two issues are under discussion. 


6.3.2.4 Summary 


This section describes the underlying concepts and implementation of a secure data 
collection infrastructure with distributed standardized EMR storage. Using the PSI 
data collection technology, the experimental results demonstrated high performance. 
A few issues remain for future implementation. 
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6.3.3 Privacy Risk Assessment of Extracted Datasets 


6.3.3.1 Overview 


This section describes a prototype of a Web service that enables a series of operations 
to perform privacy risk evaluation against a dataset extracted from multiple storages 
by the PSI service developed. 


6.3.3.2 Method 


The PSI and privacy impact assessment (PIA) libraries are applied using SS-MIX2 
standardized storage that adopts FUSE, one of the virtual file systems developed 
so far. As a FUSE, pgfuse corresponding to PostgreSQL was adopted. Assuming 
that a service for finally collecting data safely will be operated in the public cloud, 
the server and client, which will be the nodes of the data collection infrastructure 
based on the PSI and PIA libraries, are configured as Web services using SOAP. The 
configuration of the experimental system is shown in Fig. 6.5. 

The data for verification was constructed by virtually distributing the HL7 v2 
format data obtained by anonymously processing the SS-MIX2 standardized storage 
data held by The University of Tokyo Hospital to three storages and constructing a 
virtual multi-facility environment. Storing 1 million specimen test result messages 
for each storage, creating a dataset using the PSI library, and developing a user 
interface that can apply the extracted dataset to the risk assessment function in a one- 
stop manner. In addition, patients between SS-MIX2 standardized storages were 
artificially adjusted with 10% duplication as a count of the patients. 


SS-MIX2 
(PGFUSE) 
ee 


SS-MIX2 
(PGFUSE) 
= 


PSI Mar DB 


SS-MIX2 
(PGFUSE) 
E 


Privacy risk evaluation of collected data set = 


Fig. 6.5 Overview of experimental settings for privacy risk assessment 
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Fig. 6.6 Operation flow of the developed privacy risk assessment service 
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Fig. 6.7 Overview of the developed GUI for privacy risk assessment 


The privacy risk evaluation function is configured as a separate Web service and 
positioned so that it can be operated as data extraction processing by PSI and data 
processing after acquisition. The system operation flow of the risk evaluation function 
is shown in Fig.6.6. While checking the maximum and minimum values and the 
number of data in each data item of the extracted dataset on the screen, top and 
bottom coding and generalization processing (processing of numerical data with the 
specified division accuracy) were constructed. 

Figure 6.7 shows the user interface for evaluating privacy risk developed in our 
project. The dataset extracted by the PSI service can be read, and the maximum and 
minimum values of each data item can be confirmed. For numerical data, processing 
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can be performed by specifying the upper and lower limits and division unit. In 
addition, it is possible to calculate the degree of overlap after processing the target 
dataset using the attribute value group specified on the screen and have an interface 
for confirming an index for privacy risk evaluation. 


6.3.3.3 Results and Discussion 


The privacy risk evaluation service operates with a response of up to several tens of 
seconds when numerous attribute values are specified for a post-extraction dataset 
with a scale of 100,000. Although there is no problem in performance, it is a config- 
uration in which functions are centrally arranged on the service side regarding the 
processing of numerical data and evaluation of redundancy, and it does not function 
unless the original data is exposed to the service side. From the viewpoint of data 
concealment, it remains a problem, and the functional layout needs to be reconsid- 
ered. 


6.3.4 Secondary Use and Traceability 


This section describes how to implement the capability of traceability in the devel- 
oped system for secure data collection and analysis.” 


6.3.4.1 Objective 


This section describes how to implement the capability of traceability in the devel- 
oped system for secure data collection and analysis. Blockchain technology has been 
recently applied in healthcare fields, including primary patient care, data aggregation 
for research purposes, and connecting healthcare providers [10-12]. The system that 
we are developing has a second purpose: to secure the traceability of EMR data, 
methods to disclose the logs of secondary use are needed. In the present situation, 
where patients do not have any common ID, it is difficult for a patient to audit all 
the secondary use logs across the distributed hospital storages that he/she visited. By 
blockchain technology, we expect to provide patients a common search infrastructure 
with immutable secondary use logs. Thus, we plan to apply blockchain technology 
to the aggregation of data extraction log records. This method has several possi- 
ble implementations, and they must be evaluated assuming operations in real use. 


?This section is reprinted from “Studies in Health Technology and Informatics, Vol 264, Katsuya 
Tanaka, Ryuichi Yamamoto, Assessment of Traceability Implementation of a Cross-Institutional 
Secure Data Collection System Based on Distributed Standardized EMR Storage, pp. 1373-1377,” 
Copyright (2019), with permission from IOS Press. The publication is available at IOS Press through 
http://dx.doi.org/10.3233/shtil90452. 
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The following experimental results mainly concern data structure and transaction 
performance compared with traditional implementation for achieving the aggrega- 
tion of distributed log records of EMR data extraction. 


6.3.4.2 Methods 


Traceability for Patients 
EMR storage for the developed secure data collection system is supposed to process 
queries from clinical researchers using the standard interface implemented by the 
PostgreSQL database. EMR data are extracted by data extraction requests handled 
by the PSI service. Thus, the selected records are identifiable based on each query 
result, and the records represent the disclosure history of EMR data during data 
collection through the use of the developed PSI service. By making the log record 
of extraction searchable by patients, we suppose that traceability in the secure data 
collection system will be achieved. However, because storage is supposed to be 
distributed at each hospital, log records must be aggregated by some secure method 
to be made auditable. 

Log data is assumed to be represented by a combination of the following attributes: 


Identifier of target patient (patient identifier) 
Storage source (medical institution identifier) 
Disclosed destination (extracting user identifier) 
Purpose of use 

Type of extracted EMR data 

Extraction timestamp 


Oy Fei NS 


Attribute 1 (patient identifier) is mandatory for patient identification. In Japan, at 
present, universal patient identifiers are not available. We assume that insurance num- 
bers may be desirable for searching log records across medical institutions because 
the patient ID at one medical institution is only applicable for searching log records 
at that medical institution. 

Attribute 2 (medical institution identifier) is used to distinguish the institution 
storing the extracted EMR data. 

Attribute 5 (type of extracted EMR data) is represented by HL7 v2 message types 
such as “ADT-00,” “OMP-01,” and “OML-11.” 

Attributes 3, 4, 5, and 6 are used to distinguish the secondary use of target EMR 
data by patients. By verifying these attributes, patients can determine whether actual 
secondary uses meet their consent. 


Data Structure for Query 

A query for EMR storage may extract the records of several patients at one time. 
For disclosing extracted history to patients, the extracted history should be sorted by 
patient, and each history should include the aforementioned attributes. 
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Table 6.7 Sample data representing extraction history 


"patientID":"781e5e245d69b566979b86e28d23f2c7", 
"insuti-tionID":"aabd258c8894b996e8d8561fa868364d", 
"disclosedDestination": "AnalysisUser001", 
"purposeofUse": "DrugDevelopment", 
"typeofRecords":"OMP-01", 
"extractionTime":"2018/11/12 01:23:45" 


By focusing on one patient, the extracted history grows as queries hit the target 
patient EMR record. Moreover, this extracted history is distributed at each EMR 
storage site across the participating medical institutions. 

For achieving desirable response, the aggregation of extracted history should be 
obtained in a realistic time. This is closely related to the data structure and size of 
each log record. Future studies should focus on the data size of stored log records. 

In the performance test, a simple message structure is defined as a JSON (shown 
in Table 6.7). The identifiers of patient and institution are represented as hash val- 
ues. Each log record can be stored separately in the blockchain (separate style) or 
aggregated in a block by a patient appending records to the corresponding block 
(appending style). In the former method, the pieces of the records related to the 
patient of interest must be gathered. In the latter method, the block size grows as the 
system is used. We examined performance differences when the data size of a record 
to be written is changed. 


Experimental Setups 
We evaluated the following three approaches to implement traceability function. Of 
these, two are based on blockchain technology. The last approach uses the same 
method of secure data collection as PSI against log records stored in distributed 
PostgreSQL databases. 


e Hyperledger Fabric [13] 
e BigchainDB [14] 
e PSI (Bloom filter) 


The experimental settings for each approach are described as follows. Between 
Hyperledger Fabric and BigchainDB, key/value store implementation for search use 
differs from each other. 
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Fig. 6.8 Experimental settings (hyperledger fabric) 


1. Hyperledger Fabric 
Figure 6.8 shows the experimental setup using Hyperledger Fabric to store query 
log records during data collection. Assuming two participating institutions, two 
nodes were set for the performance test. Native implementation only offers key- 
value storage and is applicable to a separate style. Furthermore, we evaluated 
Hyperledger implementation with CouchDB [15], which enables query against the 
value of the JSON message described earlier. Thus, both separate and appending 
styles can be implemented. 

2. BigchainDB 
Figure 6.9 shows the experimental settings for using BigchainDB to store log data. 
As mentioned earlier, two nodes were prepared for evaluation. MongoDB [16] 
was selected as the backend database. In this case, both separate and aggregated 
structures are possible on the same implementation. 
Query key candidate is the transaction ID of the stored block or stored JSON 
value. 

3. PSI (Bloom filter) 
Figure 6.10 shows the experimental settings in the case of PSI implementation. 
The log records of data extraction are recorded at the time of extraction. Using the 
same method of EMR data collection, we can gather the log records against dis- 
tributed storages under encryption. Particularly, although the search is performed 
by specifying the insurance number, date of birth, and gender by patients, since 
the matching is performed using the bloom filter, these values are not directly 
disclosed on the infrastructure. 
In this test, three nodes were prepared for evaluation, but the performance test 
measurement was executed on only one node. 
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Fig. 6.9 Experimental settings (BigchainDB) 
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6.3.4.3 Results 


Performance by Data Size 

Figure 6.11 shows the performance of writing records to the blockchain storage by 
record size for Hyperledger. In the experimental environment, it worked normally 
for records with a size of 7 MB or smaller. As the record size grew, the response 
became unstable. 

Figure 6.12 shows the same test for BigchainDB. The maximum record size was 
0.6 MB, which was much lower than that for Hyperledger. However, the transaction 
time to commit was larger than that for Hyperledger. 

By contrast, the data size for PSI can be as large as allowed by the database system. 


Transaction Performance 

Figure 6.13 shows the performance results of writing records to the blockchain stor- 
age for Hyperledger with/without CouchDB and BigchainDB under one or five thread 
processings. In all cases, processing by threads contributed to storage performance, 
but the throughput did not increase linearly with the number of threads. 

Comparing the three implementations, BigchainDB was slightly faster than 
Hyperledger. Hyperledger with CouchDB had the worst performance; this is likely 
caused by the cost of indexing within CouchDB. In the best case, 1 million records 
were written to the blockchain storage in 3—4 h. This performance is equivalent to 
writing 10 million records or less in one day. 

Comparing these implementations using blockchain technology, the performance 
of PSI was equivalent to the “insert” performance of the PostgreSQL database used. 
The necessary time for inserting 1 million records to the database was below 10 min. 
This performance is about 1,000 times faster than the blockchain implementations. 


DatasSize(MB) 


Fig. 6.11 Performance results by record size (Hyperledger) 
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Fig. 6.14 Performance result of querying records 


Query Performance 

Figure 6.14 shows the performance test results of retrieving one record from the 
blockchain storage using four types of implementation. No significant differences 
were noted in the query response times between Hyperledger and BigchainDB. 
“Hyperledger Key” and “BigchanDB transid” represent the separate style of storage, 
whereas “Hyperledger Value” and “BigchainDB AssetsText” represent the aggre- 
gated style. 

Query response is fast enough for actual use in the case of 1 million records in the 
storage. This result shows hitting 1 record, and the response time linearly increases 
as hit records increase. 

On the other hand, PSI implementation needs | min or less to aggregate the 
extracted results across the distributed databases. 


6.3.4.4 Discussion 


Based on the initial evaluations, the following recommendations are made. 


Transaction Performance 

The transaction performance of a blockchain network was quite low for storing 
massive numbers of log records generated by queries in the developed system. In the 
case of blockchain, at most 100 transactions per second is best for a node to register 
to storage. Compared with implementation with PostgreSQL, the total transactions 
per day will be 1,000 times smaller. If we do not implement any aggregation of 
log records, it will be impossible to process the enormous numbers of log records 
generated for each EMR item. Some patient-based aggregation of log records should 
be considered to overcome performance limitation. 
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Data Size 

The results by data size show the upper limit for storing log records to the blockchain 
storage. As writing large records to storage makes the system unstable, writing in 
the appending style is not suitable because of the long operation time of the sys- 
tem. Considering the transaction performance test results mentioned earlier, the total 
number of transactions to the blockchain network per day should be limited. 


Query Response 

As the amount of storage increases, the search function must query all storage in the 
network. The whole log records thus require some possible indexes for searching 
by patient. The query performance test results show a good response for searching 
for a log record in the blockchain network despite the increase in the number of log 
records. 


Proposed System for Future Implementation 

Based on the performance evaluation results, we decided to implement the following 
policy as the basis for making the search log history visible to patients when using 
the developed secure data collection system: 


e Aggregate log data by patient in each facility. 

e All log records are stored at each facility. 

e Record the minimum amount of data, such as the log record identifier key and 
facility identifiable key, for retrieving index data in the blockchain. 

e For query log data, use personal identification information, such as insurance 
number, date of birth, and gender. 


By following these policies, a patient can search the blockchain and find the 
storage facility. Moreover, the number of records that must be recorded per period 
can be reduced to the number of related patients. Figure 6.15 shows an overview of 
the proposed log search system. The log records should include the following: 


Facility identifiable key 

Log record identifiable key 

Digest to audit each log record 

Key to identify each patient (this could be generated by encrypting a patient iden- 
tifier such as insurance number, date of birth, and gender) 


e 
e 
e 
e 

We plan to develop a log search system with the described structure. 


6.3.4.5 Limitations 


Because we did not have sufficient time to set up larger records, performance tests 
were executed for 1 million records or less. As the number of records increases, the 
test results and system stability may change. Performance tests with more records 
are required in the future work. Similarly, performance should be estimated for larger 
numbers of nodes. 
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Fig. 6.15 Overview of the proposed log search service using a blockchain network 


6.3.4.6 Summary 


This section reports the initial performance results related to traceability for a secure 
data collection system under development. The desired data structure and system 
infrastructure were examined. Although blockchain implementation is a strong can- 
didate for establishing an audit infrastructure to verify the use of EMR data for 
clinical research, there are some challenges for maintaining long-term operation 
as the amount of data increases. Thus, we proposed a data structure and querying 
implementation to overcome the implementation performance. 


6.4 Integration and Prospects 


As described earlier, the implementation and verification of the following element 
function have been carried out for a secure secondary use of medical data with 
the capability of access control by consent information and secondary use status 
confirmation by traceability function. The key features of our medical test bed are 
the following: 


1. Improvement of the searchability of medical data in SS-MIX2 standardized stor- 
age 

2. Safe medical data extraction function from SS-MIX2 standardized storage using 
PSI 

3. Electronic description of consent information and mechanism for checking con- 
sent information when extracting data 

4. Privacy risk assessment function for the extracted dataset 

5. Traceability function that can be verified by patients. 
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Currently, the development of the aforementioned functions is being integrated 


and developed with the in mind that it can be used as a Web service applicable to 
public cloud. 


If our developed system is ready on public cloud, it would help clinical researchers 


to conduct cross-institutional data collection and analysis with a certain level of 
security guaranteed. 
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